merging trunk changes with IIS-CDH-5.3.0 branch
creating IIS-CDH-5.3.0 branch
#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072
#1176 introducing side products removal in common import by maintaining remove_sideproducts flag set to true by default. Notice: do not provide any output directory location pointing to workingDir subdirectory!
#1147 introducing HTML import and HTML plaintext ingestion in main workflows: primary and preprocessing
#1133 dropping useless workfing_dir creation for java nodes
#1070 introducing support for multiple context identifiers, replacing import_project_concepts_context_id IIS input parameter with import_project_concepts_context_ids_csv
#1022 introducing PMC extracted document metadata collapser removing duplicates before sending output to PMC citation ingestion module
#919 enabling concepts matching for FET projects in mainworkflows: import, export, primary and preprocessing
#1017 accepting ExtractedDocumentMetadata instead of DocumentText at PMC citation ingestion input. Aliging integration test and importer workflow.
#1022 introducing extracted document metadata collapser at importing phase.Propagating extracted document mentadata (including PMC ingested metadata) to processing part of workflow what can be exploited by citation matching module.Introducing citations collapser in last stage of processing phase collapsing ingested citations with matched citations.
#1017 introducing new PMC metadata ingestion currently extracing references, journal and pages fields.Replacing DOM/XPath based citations ingestion with much faster SAX version. Changing pmidtooaid transformer utilizing ExtractedDocumentMetadata instead of parsing XML file. Enabling PMC metadata ingestion in common/import.
#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema
introducing support for active_existence_filter, set to true by default. Setting this parameter to false allows processing contents not having its counterpart among metadata records retrieved from HBase. This solution was required to e.g. process ubiquity contents which were not present in HBase dump metadata.
renaming metadataextraction_excluded_ids to more appropriate metadataextraction_excluded_checksums
#913 introducing support for max file size parameter, currently checked against Content-Lenght header
#757 hooking up ingest_pmc_idmapping_pmidtooaid subworkflow with mainworkflows/common/import. From now on citations are matched by pmid as well.
introducing merge_body_with_updates flag support in common/import, setting to true in statistics workflow
introducing regex support in result approver to support iis::* kind of provenance, updating workflow definitions with proper regex values
#840 moving IdentifierMapping from importer to common package
#840 renaming DeduplicationMapping to more generic IdentifierMapping
#757 aligning common importer with current API of PMC citations ingestion
introducing trust_level_threshold support in common import workflow
allowing overriding inference_provenance_blacklist default 'iis' value which will be required in mainworkflows/statistcs where inferenced document to project relations should be taken into account
fixing output port names: removing default values for citation_pmc and dataset, setting proper output_citation_pmc in both preprocessing and primary workflows
skipping PMC citations ingestion when citationmatching algorithm is not enabled
intregrating pmc citations ingestion with primary workflow, adjust port names, deduplicating dependencies