merging trunk changes with IIS-CDH-5.3.0 branch
renaming test resources to be compliant with windows file system naming requirements: replacing '|' with '_'
renaming test resources to be compliant with windows file system naming requirements
fixing destination id in expected citation record
updating fundingtree value to xml representation and changing expected fundingclass as outcome
#1498 adding missing propagate configuration element
#1498 adding missing collapsers_basic_collapser in imports.txt file
#1498 introducing major citations related refactoring including new generic direct citation matching moved to processing phase, introduced position field in all citations schemas and updated collapser taking position into account when merging citations details coming from 3 variuos sources: fuzzy citationmatching, direct citationmatching, references metadata
updating job.properties
#1397 removing obsolete parameters in subworkflow actions definitions
#1209 introducing support for trust level thresholds provided as IIS input parameter
removing obsolete quick run workflows
#1212 updating classification test expected results after fixing typo: dccclasses->ddcclasses in taxonomies.db
#1383 replacing explicitly defined test cluster properties with init-test-cluster-config maven profile usage
updating properties
#1381 porting pmc citations ingestion from cascading framework to pig. Moving code from icm-iis-ingest-pmc to icm-iis-transformers including itegration tests, removing obsolete scala code along with unneded dependencies. Switching subworkflow in primary workflow.
updating expected classes, setting acm classes
updating expected classes
#1339 fixing input_dedup_map in pmc citation ingestion when match_content_with_metadata=false. Should not be set dynamically but statically, it will be enabled only when metadata_import is enabled
#1339 replacing active_existence_filter flag with match_content_with_metadata and changing identifiers matching logic: when flag is disabled neither contents identifiers will be filtered nor deduplicated against metadata identifiers. Up unit now, when active_existence_filter flag was disabled contents were deduplicated which is not desired when running IIS in standalone mode on contents having their representatives in HBase
#1329 enabling pmc ingestion when active_metadataextraction_export flag is enabled
introducing missing pdb reference extraction missing parameters
bugfix: renaming obsolete decision-export to decision-export-to-hbase
#1308 reverting uri:oozie:distcp-action:0.2 change: version is not properly recognized by oozie 3.3.2-cdh4.3.1
#1260 enabling document to protein databank reference extraction in primary workflow, supporting 3 new parameters: active_referenceextraction_pdb, export_action_set_id_document_pdb, export_referenceextraction_pdb_url_root
#1315 providing missing confidenceLevel
#1315 updating expected jsons in integration test after DocumentToConceptIds schema refactoring
#1301 introducing explicit export mode flags: active_export_to_hbase and active_export_to_json. This way both exports can be enabled or both of them can be disabled.
#1308 switching distcp namespace to uri:oozie:distcp-action:0.2
reverting 37153 rev change by removing oozie-sharelib-distcp dependency from pom.xml file and relying on oozie.use.system.libpath=true set among job.properties
removing log.txt
disabling provided scope for hbase-client dependency
adding oozie-sharelib-distcp dependency missing in cdh5
disabling export by setting active_export flag to false. Results will be converted to JSON records
#1301 introducing common/export_to_json and utilizing this subworkflow in both primary and preprocessing workflows executing it when active_export=false which means hbase export is disabled
#118 explicitly defining input_document_websiteusage_similarity parameter. This is not a bug fix because exporter works properly without explicitly defining input port due to propagate-configuration mode but we should have all input port definitions aligned to avoid confusions.
bugfixing existing fault removal which was missing
upgrading xmlns version to 0.4 in order to support global element
setting false to remove_sideproducts, otherwise whole workingDir will be erased
changing home dir to /mnt/tmp
updating deploy.info with new IIS test cluster parameters
#1257 raising oozie.action.max.output.data to 8192
#1257 dropping schema generation related hacks in all map-reduce modules, switching to literal schema parameters
updating job.properties, disabling all algorithms by default
#1248 fixing transition node name postprocessing-joining to merge-joining
#1248 introducing fault subdirectory support in all workflows wrapping metadataextraction subworkflow up to the processing and primary root workflows. This should prevent fault directory from being removed when ${remove_sideproducts} flag is enabled, it will be propagated along with metadata and plaintext.
Removing usage of working_dir from Java workflow node.
updating README file
#1187 moving changelog contents to redmine wiki
#1198 aligning IIS dependencies and java code to CDH5.3.0 cluster
#1197 introducing job.properties changes aligning paths to rumcajs cluster HDFS structure
creating IIS-CDH-5.3.0 branch
reenabling document to project reference import validation
updating expected documents list
temporarily skipping docproject validation
#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072
fixing json escape character by putting \\ in place of \
extending mapreduce metadata importer test with validating import of different kind of relations and dataset identifier
removing obsolete citations
updating confidence level value to 1.0 for record coming from PMC
removing obsolete pdf directory
adding missing "confidenceLevel" field
#1187 introducing IIS changelog
maintaining pmc citation and testing citations merging process
reintroducing multiple citations after introducing sorting in transformer
limiting citations count to 1 until results order produced by citation matching module is repetitive
including: FET project reference extraction, EGI case, dataset reference extraction outcome validation
enabling citation matching algorithm
updating expected citations
removing comment
primary processing integration test major refactoring: dropping cermine execution and providing plaintext and extracted metadata as json records
#1176 defining remove_sideproducts property in workflows headers
#1176 introducing side products removal in common import by maintaining remove_sideproducts flag set to true by default. Notice: do not provide any output directory location pointing to workingDir subdirectory!
removing duplicate collapser import and aligning worklfow definition
#1172 introducing support for active_export parameter in both preprocessing and primary workflows
#1153 utilizing ${user.name} placeholder in ${workingDir} generation process, copying version.properties from oozie_app to mark execution environment with application version
#1147 introducing HTML import and HTML plaintext ingestion in main workflows: primary and preprocessing
#1147 renaming icm-iis-ingest-webcrawl module to icm-iis-ingest to make it more generic so it could contain not only webcrawl related ingesters but html ingesters as well
overriding memory parameter due to test cluster memory limitations, setting it to Xmx256m
overriding memory parameter due to test cluster memory limitations, setting it to Xmx128m
overriding memory parameter due to test cluster memory limitations, setting it to Xmx512m
updating expected classes in integration test after recent #720 change and fixing confidence level distribution