merging trunk changes with IIS-CDH-5.3.0 branch
#1498 adding missing position field
#1498 removing obsolete ingest pmc citation resources
#1498 introducing major citations related refactoring including new generic direct citation matching moved to processing phase, introduced position field in all citations schemas and updated collapser taking position into account when merging citations details coming from 3 variuos sources: fuzzy citationmatching, direct citationmatching, references metadata
Initial commit - settings and project files added to svn:ignore
#1422 fixing Java Heap Space error while executing checksum postprocessing worfklow on pmc plaintexts
#1381 porting pmc citations ingestion from cascading framework to pig. Moving code from icm-iis-ingest-pmc to icm-iis-transformers including itegration tests, removing obsolete scala code along with unneded dependencies. Switching subworkflow in primary workflow.
expecting null affiliations instead of empty array
adding missing affiliations field in input data, removing duplicates from outut
adding missing affiliations field in integration test expected output
#1315 propagating confidenceLevel to DocumentToConceptIds. Updating PIG transformer script by introducing concept identifiers deduplication UDF function picking record with the highest confidence level, introducing unit and integration tests. Propagating changes in document to concepts exporter module.
removing obsolete test resources
#1329 adding affiliations field in ExtractedDocumentMetadata PMC schema. Metadata extraction code refactoring by extracting code responsible for building Affiliation avro records to AffiliationBuilder class and sharing it with pmc ingestion. Implementing affiliations ingestion functionality in PmcXmlHandler covered with unit tests. Adding affiliations field support in ingest pmc metadata transformer.
#1306 introducing dummy field in DocumentId schema required to overcome https://issues.apache.org/jira/browse/PIG-3358 issue. Handling dummy filed in transformer pig scripts when it is required. Should be reverted as soon as PIG-3358 issue is fixed
#1312 wrapping tuple schema returned by outputSchema() method as described in PIG-3082
removing oozie-sharelib-distcp dependency from pom.xml file and relying on oozie.use.system.libpath=true set among job.properties
replacing icm-iis-3rdparty-pig-avrostorage dependency with original piggybank
#1301 skipping transformation when input set to $UNDEFINED$ value
#1301 removing redundant schema parameter
#1301 introducing generic avro to json transformer
bugfix: adding missing start element
#1257 raising oozie.action.max.output.data to 8192
#1257 dropping schema generation related hacks in all PIG modules, switching to literal schema parameters
#1135 switching icm-iis-parent-container version to 1.0.1-SNAPSHOT in order to include workingDir related changes made in icm-iis-core
Removing usage of working_dir from Java workflow node.
#1210 introducing generic PIG module filtering inferred data by confidence level
#1198 aligning IIS dependencies and java code to CDH5.3.0 cluster
#1197 introducing job.properties changes aligning paths to rumcajs cluster HDFS structure
creating IIS-CDH-5.3.0 branch
introducing branches folder
#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072
introducing repetetive ordering of citations by ordering them by citation rawText
#1169 fixing duplicate context issue, introducing integration test proving implemented solution works properly
simplifying schema related PIG parameters
#1147 introducing union4 pig script
#1133 dropping useless workfing_dir creation for java nodes
#1133 dropping useless workfing_dir creation for pig nodes
#1038 introducing ranges in dependencies definition for all IIS modules
#118 introducing website usage community filter filtering out publication identifiers based on ids set retrieved from InformationSpace. This is required to exclude removed publications which were still present in logs.
#118 removing obsolete and duplicate transformer
updating job.properties
[maven-release-plugin] prepare for next development iteration
[maven-release-plugin] copy for tag icm-iis-transformers-1.0.0
[maven-release-plugin] prepare release icm-iis-transformers-1.0.0
#1044 pre-release switching to released version of parent pom and released dependencies
introducing scm definition
#919 renaming DocumentToResearchInitiative to DocumentToConceptId and DocumentToResearchInitiatives to DocumentToConceptIds
#1019 introducing integration test
#919 introducing integration test input and output
#919 introducing integration test containing empty input and output
#919 introducing project to concept transformer module
#1019 introducing PIG module transforming pmc ingested metadata into common extracted document metadata
#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema
introducing embedded integration test entry
#913 renaming DocumentContentUrl#contentSize to DocumentContentUrl#contentSizeKB changing field type from int to long, importing content size from ObjectStoreFile#fileSizeKB, updating dnet-objectstore-rmi dependency from 1.0.0 to 2.0.1-SNAPSHOT
#913 supplementing json files with newly introduced DocumentContentUrl#contentSize field value set to null
#913 introducing DocumentContentUrl#contentSize field, handling it properly in all PIG transformers
#840 moving IdentifierMapping from importer to common package
#840 renaming DeduplicationMapping to more generic IdentifierMapping
introducing cloudera repository in parent container, removing repository definitions from individual IIS modules
adding missing affiliation fields: countryCode, address, renaming country to countryName
created tag folder for release
#757 introducing doitooaid transformer processing DocumentMetadata datastore holding metadata imported from InformationSpace and creating datastore holding <doi,oaid> pairs which will be used by pmc ingestor for matching references identified by doi
null reference ids removed
updating default job.properties
removing memory related properties, fixing #757 should solve all memory related problems
#568 introducing citations grouping by sourceDocumentId, still to be adjusted for ingested pmc citations outcome which currently seems to hang up
#577 introducing UDF producing empty map, two transformers building common Citation datastore from citationmatching and pmc ingestion outcome. Both are required by collapser.
introducing importer/plaintext/skip_extracted transformer required for plaintext import caching
#354 removing obsolete transformers/export/person transformer along with tests
#354 removing obsolete transformers/export/inferenced_document_without_imported_data transformer along with tests
#354 removing obsolete transformers/export/identifier/referenceddatasets transformer along with tests
#354 removing obsolete transformers/export/identifier/documents transformer along with tests
#354 removing obsolete transformers/export/document transformer along with tests
replacing redundant transformers/ingest/pmc/citations with already existing transformers/importer/documentmetadata/idextractor
adding missing "confidenceLevel" field
introducing deploy.info file for module icm-iis-transformers