merging trunk changes with IIS-CDH-5.3.0 branch
#1197 introducing job.properties changes aligning paths to rumcajs cluster HDFS structure
creating IIS-CDH-5.3.0 branch
updating job.properties
#1017 fixing PMC and DOI identifiers retrieval from avro map: addressing by Utf8 objects not by String
#1017 accepting ExtractedDocumentMetadata instead of DocumentText at PMC citation ingestion input. Aliging integration test and importer workflow.
#1017 introducing new PMC metadata ingestion currently extracing references, journal and pages fields.Replacing DOM/XPath based citations ingestion with much faster SAX version. Changing pmidtooaid transformer utilizing ExtractedDocumentMetadata instead of parsing XML file. Enabling PMC metadata ingestion in common/import.
#955 fixing reference raw text generation for pretty printed NLM documents
#840 renaming DeduplicationMapping to more generic IdentifierMapping
#840 moving IdentifierMapping from importer to common package
#757 adding reducing phase for filtering out pmids by article type, mapping phase groups PmidMapping objects by pmid and at reducer phase duplicates will be filtered out
#757 introducing article type extraction along with unit test. Article type will be required for filtering out pmc duplicates and leaving only proper types
#757 fixing pmid and doi matching, fixing sourceDocumentId and destinationDocumentId generation
Stub of a solution to the task #576: Ingestion of metadata from EuropePMC.
Refactored code to use the XPathEvaluator.fromString method.
updating default job properties
renaming workflow to ingest_pmc_plaintext
replacing "result" string with Type.result.name()
dir names in parameters should not contain nameNode
rename a field