Revision 37368
Added by Marek Horst about 9 years ago
transformer.pig | ||
---|---|---|
7 | 7 |
'index', '0', |
8 | 8 |
'schema', '$schema_output_document_to_research_initiatives'); |
9 | 9 |
|
10 |
define DEDUPLICATE_IDS_WITH_CONFIDENCE eu.dnetlib.iis.transformers.udfs.DeduplicateIdsWithConfidence; |
|
11 |
|
|
10 | 12 |
documentToResearchInitiative = load '$input_document_to_research_initiative' using avro_load_document_to_research_initiative; |
11 | 13 |
|
12 | 14 |
researchInitiativeGroupped = group documentToResearchInitiative by documentId; |
13 | 15 |
researchInitiative = foreach researchInitiativeGroupped { |
14 |
ids = foreach documentToResearchInitiative generate conceptId;
|
|
15 |
distinctIds = distinct ids;
|
|
16 |
generate group as documentId, distinctIds as conceptIds;
|
|
16 |
idsWithConfidence = foreach documentToResearchInitiative generate conceptId as id, confidenceLevel;
|
|
17 |
dedupIdsWithConfidence = DEDUPLICATE_IDS_WITH_CONFIDENCE(idsWithConfidence);
|
|
18 |
generate group as documentId, dedupIdsWithConfidence as concepts;
|
|
17 | 19 |
} |
18 | 20 |
|
19 | 21 |
store researchInitiative into '$output_document_to_research_initiatives' using avro_store_document_to_research_initiatives; |
Also available in: Unified diff
#1315 propagating confidenceLevel to DocumentToConceptIds. Updating PIG transformer script by introducing concept identifiers deduplication UDF function picking record with the highest confidence level, introducing unit and integration tests. Propagating changes in document to concepts exporter module.