Project

General

Profile

« Previous | Next » 

Revision 37368

Added by Marek Horst about 9 years ago

#1315 propagating confidenceLevel to DocumentToConceptIds. Updating PIG transformer script by introducing concept identifiers deduplication UDF function picking record with the highest confidence level, introducing unit and integration tests. Propagating changes in document to concepts exporter module.

View differences:

transformer.pig
7 7
'index', '0',
8 8
'schema', '$schema_output_document_to_research_initiatives');
9 9

  
10
define DEDUPLICATE_IDS_WITH_CONFIDENCE eu.dnetlib.iis.transformers.udfs.DeduplicateIdsWithConfidence;
11

  
10 12
documentToResearchInitiative = load '$input_document_to_research_initiative' using avro_load_document_to_research_initiative;
11 13

  
12 14
researchInitiativeGroupped = group documentToResearchInitiative by documentId;
13 15
researchInitiative = foreach researchInitiativeGroupped {
14
    ids = foreach documentToResearchInitiative generate conceptId;
15
    distinctIds = distinct ids;
16
    generate group as documentId, distinctIds as conceptIds;
16
    idsWithConfidence = foreach documentToResearchInitiative generate conceptId as id, confidenceLevel;
17
    dedupIdsWithConfidence = DEDUPLICATE_IDS_WITH_CONFIDENCE(idsWithConfidence);
18
    generate group as documentId, dedupIdsWithConfidence as concepts;
17 19
}
18 20

  
19 21
store researchInitiative into '$output_document_to_research_initiatives' using avro_store_document_to_research_initiatives;

Also available in: Unified diff