updating job.properties, disabling all algorithms by default
#1248 fixing transition node name postprocessing-joining to merge-joining
#1248 introducing fault subdirectory support in all workflows wrapping metadataextraction subworkflow up to the processing and primary root workflows. This should prevent fault directory from being removed when ${remove_sideproducts} flag is enabled, it will be propagated along with metadata and plaintext.
updating job.properties
Removing usage of working_dir from Java workflow node.
updating README file
#1187 moving changelog contents to redmine wiki
#1198 aligning IIS dependencies and java code to CDH5.3.0 cluster
#1197 introducing job.properties changes aligning paths to rumcajs cluster HDFS structure
creating IIS-CDH-5.3.0 branch
reenabling document to project reference import validation
updating expected documents list
temporarily skipping docproject validation
#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072
fixing json escape character by putting \\ in place of \
extending mapreduce metadata importer test with validating import of different kind of relations and dataset identifier
removing obsolete citations
updating confidence level value to 1.0 for record coming from PMC
removing obsolete pdf directory
adding missing "confidenceLevel" field
#1187 introducing IIS changelog
maintaining pmc citation and testing citations merging process
reintroducing multiple citations after introducing sorting in transformer
limiting citations count to 1 until results order produced by citation matching module is repetitive
including: FET project reference extraction, EGI case, dataset reference extraction outcome validation
enabling citation matching algorithm
updating expected citations
removing comment
primary processing integration test major refactoring: dropping cermine execution and providing plaintext and extracted metadata as json records
#1176 defining remove_sideproducts property in workflows headers
#1176 introducing side products removal in common import by maintaining remove_sideproducts flag set to true by default. Notice: do not provide any output directory location pointing to workingDir subdirectory!
removing duplicate collapser import and aligning worklfow definition
#1172 introducing support for active_export parameter in both preprocessing and primary workflows
#1153 utilizing ${user.name} placeholder in ${workingDir} generation process, copying version.properties from oozie_app to mark execution environment with application version
#1147 introducing HTML import and HTML plaintext ingestion in main workflows: primary and preprocessing
#1147 renaming icm-iis-ingest-webcrawl module to icm-iis-ingest to make it more generic so it could contain not only webcrawl related ingesters but html ingesters as well
overriding memory parameter due to test cluster memory limitations, setting it to Xmx256m
overriding memory parameter due to test cluster memory limitations, setting it to Xmx128m
overriding memory parameter due to test cluster memory limitations, setting it to Xmx512m
updating expected classes in integration test after recent #720 change and fixing confidence level distribution
overriding memory parameter due to test cluster memory limitations
#1133 dropping useless workfing_dir creation for java nodes
#1038 introducing ranges in dependencies definition for all IIS modules
#118 fixing typos
#118 introducing website usage analysis as integral part of primary workflow
#118 propagating configuration in main workflow.xml
introducing explicitly defined icm-iis-schemas SNAPSHOT dependency to prevent resolving earlier, released transitive version
#118 upgrading IIS dependencies to most recent snapshots
#118 updating job.properties
#118 introducing uoa-iis-websiteusage dependency in mainworkflows
comments added
#118 introducing mainworkflows_websiteusage_document_main workflow binding all subworkflows required to process logs and generate document similarities
#1083 enabling webcrawl ingester module extracting FX field from plaintext before executing project reference extraction
updating default job properties
#720 fixing document classification algorithm confidence level distribution, switching mainworkflows pom dependency to the fixed document classification snapshot
#1070 updating import_project_concepts_context_ids_csv default value to "fet-fp7,fet-h2020"
#1070 introducing support for multiple context identifiers, replacing import_project_concepts_context_id IIS input parameter with import_project_concepts_context_ids_csv
#1072 transition fix: replacing forking_skip_imported_data with export
#1072 dropping IIS feature filtering out already existing project and dataset references from IIS export
#1065 upgrading icm-iis-parent-container dependency from 1.0.0 to 1.0.1-SNAPSHOT after introducing FCT support
#1065 upgrading uoa-iis-referenceextraction dependency from 1.0.0 to 1.0.1-SNAPSHOT after introducing FCT support
[maven-release-plugin] prepare for next development iteration
[maven-release-plugin] copy for tag icm-iis-mainworkflows-1.0.0
[maven-release-plugin] prepare release icm-iis-mainworkflows-1.0.0
changing snapshot dependencies to released ones
#1044 upgrading dependencies to released versions and parent version to most recent snapshot for unreleased modules
introducing scm definition
#919 renaming DocumentToResearchInitiative to DocumentToConceptId and DocumentToResearchInitiatives to DocumentToConceptIds
#1022 introducing PMC extracted document metadata collapser removing duplicates before sending output to PMC citation ingestion module
#919 adding missing i/o ports related to FET projects reference extraction
#919 enabling concepts matching for FET projects in mainworkflows: import, export, primary and preprocessing
#1017 accepting ExtractedDocumentMetadata instead of DocumentText at PMC citation ingestion input. Aliging integration test and importer workflow.
#1022 introducing extracted document metadata collapser at importing phase.Propagating extracted document mentadata (including PMC ingested metadata) to processing part of workflow what can be exploited by citation matching module.Introducing citations collapser in last stage of processing phase collapsing ingested citations with matched citations.
#1017 introducing new PMC metadata ingestion currently extracing references, journal and pages fields.Replacing DOM/XPath based citations ingestion with much faster SAX version. Changing pmidtooaid transformer utilizing ExtractedDocumentMetadata instead of parsing XML file. Enabling PMC metadata ingestion in common/import.
#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema
introducing separate citations json containing expected results, not enabled in workflow yet
updating job.properties: adding metadataextraction_excluded_checksums=4f5cc34f137de4dc89766a9366ca66de,6495a568200b1cee40baa00072b1800a
introducing support for active_existence_filter, set to true by default. Setting this parameter to false allows processing contents not having its counterpart among metadata records retrieved from HBase. This solution was required to e.g. process ubiquity contents which were not present in HBase dump metadata.
fixing citations schema type
renaming metadataextraction_excluded_ids to more appropriate metadataextraction_excluded_checksums
#913 introducing support for max file size parameter, currently checked against Content-Lenght header