#1498 adding missing position field
#1498 removing obsolete ingest pmc citation resources
#1498 introducing major citations related refactoring including new generic direct citation matching moved to processing phase, introduced position field in all citations schemas and updated collapser taking position into account when merging citations details coming from 3 variuos sources: fuzzy citationmatching, direct citationmatching, references metadata
#1422 fixing Java Heap Space error while executing checksum postprocessing worfklow on pmc plaintexts
#1381 porting pmc citations ingestion from cascading framework to pig. Moving code from icm-iis-ingest-pmc to icm-iis-transformers including itegration tests, removing obsolete scala code along with unneded dependencies. Switching subworkflow in primary workflow.
expecting null affiliations instead of empty array
adding missing affiliations field in input data, removing duplicates from outut
adding missing affiliations field in integration test expected output
#1315 propagating confidenceLevel to DocumentToConceptIds. Updating PIG transformer script by introducing concept identifiers deduplication UDF function picking record with the highest confidence level, introducing unit and integration tests. Propagating changes in document to concepts exporter module.
removing obsolete test resources
#1329 adding affiliations field in ExtractedDocumentMetadata PMC schema. Metadata extraction code refactoring by extracting code responsible for building Affiliation avro records to AffiliationBuilder class and sharing it with pmc ingestion. Implementing affiliations ingestion functionality in PmcXmlHandler covered with unit tests. Adding affiliations field support in ingest pmc metadata transformer.
#1301 skipping transformation when input set to $UNDEFINED$ value
#1301 removing redundant schema parameter
#1301 introducing generic avro to json transformer
bugfix: adding missing start element
#1257 dropping schema generation related hacks in all PIG modules, switching to literal schema parameters
Removing usage of working_dir from Java workflow node.
#1210 introducing generic PIG module filtering inferred data by confidence level
#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072
introducing repetetive ordering of citations by ordering them by citation rawText
#1169 fixing duplicate context issue, introducing integration test proving implemented solution works properly
simplifying schema related PIG parameters
#1147 introducing union4 pig script
#1133 dropping useless workfing_dir creation for java nodes
#1133 dropping useless workfing_dir creation for pig nodes
#118 introducing website usage community filter filtering out publication identifiers based on ids set retrieved from InformationSpace. This is required to exclude removed publications which were still present in logs.
#118 removing obsolete and duplicate transformer
updating job.properties
#919 renaming DocumentToResearchInitiative to DocumentToConceptId and DocumentToResearchInitiatives to DocumentToConceptIds
#1019 introducing integration test
#919 introducing integration test input and output
#919 introducing integration test containing empty input and output
#919 introducing project to concept transformer module
#1019 introducing PIG module transforming pmc ingested metadata into common extracted document metadata
#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema
#913 renaming DocumentContentUrl#contentSize to DocumentContentUrl#contentSizeKB changing field type from int to long, importing content size from ObjectStoreFile#fileSizeKB, updating dnet-objectstore-rmi dependency from 1.0.0 to 2.0.1-SNAPSHOT
#913 supplementing json files with newly introduced DocumentContentUrl#contentSize field value set to null
#913 introducing DocumentContentUrl#contentSize field, handling it properly in all PIG transformers
#840 moving IdentifierMapping from importer to common package
#840 renaming DeduplicationMapping to more generic IdentifierMapping
adding missing affiliation fields: countryCode, address, renaming country to countryName
#757 introducing doitooaid transformer processing DocumentMetadata datastore holding metadata imported from InformationSpace and creating datastore holding <doi,oaid> pairs which will be used by pmc ingestor for matching references identified by doi
null reference ids removed
updating default job.properties
removing memory related properties, fixing #757 should solve all memory related problems
#568 introducing citations grouping by sourceDocumentId, still to be adjusted for ingested pmc citations outcome which currently seems to hang up
#577 introducing UDF producing empty map, two transformers building common Citation datastore from citationmatching and pmc ingestion outcome. Both are required by collapser.
introducing importer/plaintext/skip_extracted transformer required for plaintext import caching
#354 removing obsolete transformers/export/person transformer along with tests
#354 removing obsolete transformers/export/inferenced_document_without_imported_data transformer along with tests
#354 removing obsolete transformers/export/identifier/referenceddatasets transformer along with tests
#354 removing obsolete transformers/export/identifier/documents transformer along with tests
#354 removing obsolete transformers/export/document transformer along with tests
replacing redundant transformers/ingest/pmc/citations with already existing transformers/importer/documentmetadata/idextractor
adding missing "confidenceLevel" field