#919 renaming DocumentToResearchInitiative to DocumentToConceptId and DocumentToResearchInitiatives to DocumentToConceptIds
#1022 introducing PMC extracted document metadata collapser removing duplicates before sending output to PMC citation ingestion module
#919 adding missing i/o ports related to FET projects reference extraction
#919 enabling concepts matching for FET projects in mainworkflows: import, export, primary and preprocessing
#1017 accepting ExtractedDocumentMetadata instead of DocumentText at PMC citation ingestion input. Aliging integration test and importer workflow.
#1022 introducing extracted document metadata collapser at importing phase.Propagating extracted document mentadata (including PMC ingested metadata) to processing part of workflow what can be exploited by citation matching module.Introducing citations collapser in last stage of processing phase collapsing ingested citations with matched citations.
#1017 introducing new PMC metadata ingestion currently extracing references, journal and pages fields.Replacing DOM/XPath based citations ingestion with much faster SAX version. Changing pmidtooaid transformer utilizing ExtractedDocumentMetadata instead of parsing XML file. Enabling PMC metadata ingestion in common/import.
#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema
introducing separate citations json containing expected results, not enabled in workflow yet
updating job.properties
updating job.properties: adding metadataextraction_excluded_checksums=4f5cc34f137de4dc89766a9366ca66de,6495a568200b1cee40baa00072b1800a
introducing support for active_existence_filter, set to true by default. Setting this parameter to false allows processing contents not having its counterpart among metadata records retrieved from HBase. This solution was required to e.g. process ubiquity contents which were not present in HBase dump metadata.
fixing citations schema type
renaming metadataextraction_excluded_ids to more appropriate metadataextraction_excluded_checksums
#913 introducing support for max file size parameter, currently checked against Content-Lenght header
enabling document classification and reserach initiatives reference extraction algorithms
#757 hooking up ingest_pmc_idmapping_pmidtooaid subworkflow with mainworkflows/common/import. From now on citations are matched by pmid as well.
#883 providing blacklisted_objectstores_csv input parameter set to $UNDEFINED$ value by default
input port name fix: input_citation->input_citations
introducing merge_body_with_updates flag support in common/import, setting to true in statistics workflow
introducing regex support in result approver to support iis::* kind of provenance, updating workflow definitions with proper regex values
#840 moving IdentifierMapping from importer to common package
#840 renaming DeduplicationMapping to more generic IdentifierMapping
#757 aligning common importer with current API of PMC citations ingestion
disabling workflow tests
#637 renaming document_extractedMetadata algorithm to more descriptive document_affiliations, propagating changes to action set identifier properties names
removing extracted_metadata.json which will not be checked anymore
reenabling PMC ingestion when citationmatching flag is set
updating job properties
skipping extracted_metadata comparison which is cumbersome due to frequent changes and large volume of references
introducing newly added address field in json record
fixing field names after recent Affiliation.avdl refactoring and adding countryCode field, renaming contry to countryName
setting export_action_set_id_entity_dataset to $UNDEFINED$ by default, this should not be required because dataset reference extraction module might be deactivated. Check will be performed at dataset entity exporter module and when value is not set - exception will be raised.
#757 temporarily disabling PMC ingestion until fixing openaire identifiers building process
#568, #577 enabling proper citations export by introducing PMC citation ingestion and citation matching outcome merging and grouping for exporting purposes. Introducing union instead of collapser which should be introduced in near future.
updating expected output
removing output_citation_pmc port duplicate
updating performance test
moving ACM importer to icm-iis-mainworkflows due to extending dependances with cermine, introducing performance tests
removing common import input parameters which are not required in this context
introducing trust_level_threshold support in statistics workflow
introducing trust_level_threshold support in common import workflow
providing default value for action_set_id_entity_dataset set to $UNDEFINED$. This change is required when exporting in statistics export mode where no entities are exported and such parameter should not be required.
introducing dedicated statistics mainworkflow encapsulating importing, processing and exporting phases. This workflow was introduced explicitly for statistics purposes because we want to operate over InformationSpace imported data in contrary to primary workflow where some of the statistics input was inferred and it wasn't clear whether it will become part of InformationSpace.
allowing overriding inference_provenance_blacklist default 'iis' value which will be required in mainworkflows/statistcs where inferenced document to project relations should be taken into account
setting default $undefined$ value for 'input_aux_dataset_existing_id'
setting default undefined values for 'mdstore_service_location' and 'dataset_mdstore_ids_csv'
#9059 reverting #717 change: shortening app_path for primary workflow due to the fix applied by Paweł on WF_JOBS MODIFY mysql table: canging varchar(255) to mediumtext.
updating expected record content
fixing placeholder name
#717 shortening app_path for primary workflow
fixing output port names: removing default values for citation_pmc and dataset, setting proper output_citation_pmc in both preprocessing and primary workflows
#717 shortening app_path for preprocessing workflow and subworkflows
#712 introducing plaintext caching
shortening node names
updating workingDir for generating empty outputs: removing import_dataset part
updating expected extracted metadata
Fixing names of parameters accepted by workflow nodes
skipping PMC citations ingestion when citationmatching algorithm is not enabled
shortening transformer_export_documentto* action names to be less than 50 characters
#354 hooking up primary/main workflow with documenttodataset and documenttoproject transformers skipping export of already existing relations in HBase
updating default job.properties
#486 fixing integration test: introducing missing document_text_wos input port for primary/processing
#486 introducing last piece missing: text collapser in front of referenceextraction_researchinitiatives joining text contents coming from already existing document_text input port and newly introduced document_text_wos input port providing WoS contents
#486 bugfix: reordering existence filter with id relacer: we need to update identifiers first, then update existence filter
intregrating pmc citations ingestion with primary workflow, adjust port names, deduplicating dependencies
renaming input ports from input_citation to input_citations to be aligned with exporter subworkflow
skipping exporting citation matching outcome
updating expected references output for doc=id-3
fixing affiliations and positions in authors details
fixing HBase model json representation to be compliant with most recent dnet-openaire-data-protos:3.0.0-SNAPSHOT model: complex relation identifiers, dataInfo on fields level etc
introducing additional logging
setting excluded_ids to undefined value