Project

General

Profile

Statistics
| Revision:

# Date Author Comment
38122 08/07/2015 05:36 PM Marek Horst

updating job.properties

38032 30/06/2015 06:49 PM Marek Horst

updating job.properties

38007 29/06/2015 03:14 PM Marek Horst

#1397 removing obsolete parameters in subworkflow actions definitions

37976 26/06/2015 05:48 PM Marek Horst

#1209 introducing support for trust level thresholds provided as IIS input parameter

37972 26/06/2015 04:05 PM Marek Horst

removing obsolete quick run workflows

37883 19/06/2015 04:35 PM Marek Horst

updating job.properties

37873 19/06/2015 02:06 PM Marek Horst

#1381 porting pmc citations ingestion from cascading framework to pig. Moving code from icm-iis-ingest-pmc to icm-iis-transformers including itegration tests, removing obsolete scala code along with unneded dependencies. Switching subworkflow in primary workflow.

37872 19/06/2015 01:48 PM Marek Horst

updating job.properties

37585 29/05/2015 04:17 PM Marek Horst

#1339 fixing input_dedup_map in pmc citation ingestion when match_content_with_metadata=false. Should not be set dynamically but statically, it will be enabled only when metadata_import is enabled

37561 29/05/2015 02:27 PM Marek Horst

#1339 replacing active_existence_filter flag with match_content_with_metadata and changing identifiers matching logic: when flag is disabled neither contents identifiers will be filtered nor deduplicated against metadata identifiers. Up unit now, when active_existence_filter flag was disabled contents were deduplicated which is not desired when running IIS in standalone mode on contents having their representatives in HBase

37533 28/05/2015 04:16 PM Marek Horst

#1329 enabling pmc ingestion when active_metadataextraction_export flag is enabled

37469 26/05/2015 10:25 AM Marek Horst

bugfix: renaming obsolete decision-export to decision-export-to-hbase

37464 25/05/2015 10:13 PM Marek Horst

#1308 reverting uri:oozie:distcp-action:0.2 change: version is not properly recognized by oozie 3.3.2-cdh4.3.1

37432 25/05/2015 01:15 PM Marek Horst

#1260 enabling document to protein databank reference extraction in primary workflow, supporting 3 new parameters: active_referenceextraction_pdb, export_action_set_id_document_pdb, export_referenceextraction_pdb_url_root

37231 14/05/2015 11:56 AM Marek Horst

#1301 introducing explicit export mode flags: active_export_to_hbase and active_export_to_json. This way both exports can be enabled or both of them can be disabled.

37194 13/05/2015 12:49 PM Marek Horst

#1308 switching distcp namespace to uri:oozie:distcp-action:0.2

37095 11/05/2015 11:46 AM Marek Horst

disabling export by setting active_export flag to false. Results will be converted to JSON records

37026 07/05/2015 01:27 PM Marek Horst

#1301 introducing common/export_to_json and utilizing this subworkflow in both primary and preprocessing workflows executing it when active_export=false which means hbase export is disabled

36989 06/05/2015 04:43 PM Marek Horst

#118 explicitly defining input_document_websiteusage_similarity parameter. This is not a bug fix because exporter works properly without explicitly defining input port due to propagate-configuration mode but we should have all input port definitions aligned to avoid confusions.

36473 20/04/2015 12:53 PM Marek Horst

bugfixing existing fault removal which was missing

36469 20/04/2015 10:53 AM Marek Horst

bugfixing existing fault removal which was missing

36443 17/04/2015 02:56 PM Marek Horst

setting false to remove_sideproducts, otherwise whole workingDir will be erased

36286 09/04/2015 07:10 PM Marek Horst

#1257 dropping schema generation related hacks in all map-reduce modules, switching to literal schema parameters

35985 03/04/2015 01:11 PM Marek Horst

updating job.properties, disabling all algorithms by default

35967 02/04/2015 10:17 PM Marek Horst

#1248 fixing transition node name postprocessing-joining to merge-joining

35946 02/04/2015 05:52 PM Marek Horst

#1248 introducing fault subdirectory support in all workflows wrapping metadataextraction subworkflow up to the processing and primary root workflows. This should prevent fault directory from being removed when ${remove_sideproducts} flag is enabled, it will be propagated along with metadata and plaintext.

35935 02/04/2015 03:59 PM Marek Horst

updating job.properties

35701 27/03/2015 06:18 AM Mateusz Kobos

Removing usage of working_dir from Java workflow node.

35229 11/03/2015 01:14 PM Marek Horst

#1195 removing obsolete ports docreation and datasetid from hbase mapred import, removing references to those ports in workflow.xml files, updating transformer by removing filtering by datasetid due to decisions made in #1072

35057 04/03/2015 05:30 PM Marek Horst

#1176 defining remove_sideproducts property in workflows headers

35048 04/03/2015 04:44 PM Marek Horst

#1176 introducing side products removal in common import by maintaining remove_sideproducts flag set to true by default.
Notice: do not provide any output directory location pointing to workingDir subdirectory!

35042 04/03/2015 03:01 PM Marek Horst

removing duplicate collapser import and aligning worklfow definition

35031 04/03/2015 01:08 PM Marek Horst

#1172 introducing support for active_export parameter in both preprocessing and primary workflows

35030 04/03/2015 12:16 PM Marek Horst

updating job.properties

34958 02/03/2015 05:21 PM Marek Horst

#1153 utilizing ${user.name} placeholder in ${workingDir} generation process, copying version.properties from oozie_app to mark execution environment with application version

34914 27/02/2015 07:34 PM Marek Horst

#1147 introducing HTML import and HTML plaintext ingestion in main workflows: primary and preprocessing

34893 27/02/2015 05:32 PM Marek Horst

updating job.properties

34702 20/02/2015 07:17 PM Marek Horst

#1133 dropping useless workfing_dir creation for java nodes

34574 18/02/2015 03:49 PM Marek Horst

#118 fixing typos

34572 18/02/2015 03:32 PM Marek Horst

updating job.properties

34563 18/02/2015 01:26 PM Marek Horst

#118 introducing website usage analysis as integral part of primary workflow

34535 16/02/2015 06:52 PM Marek Horst

updating job.properties

34533 16/02/2015 06:35 PM Marek Horst

#118 propagating configuration in main workflow.xml

34530 16/02/2015 05:57 PM Marek Horst

#118 updating job.properties

34519 13/02/2015 07:00 PM Marek Horst

comments added

34516 13/02/2015 05:55 PM Marek Horst

#118 introducing mainworkflows_websiteusage_document_main workflow binding all subworkflows required to process logs and generate document similarities

34434 11/02/2015 02:26 PM Marek Horst

#1083 enabling webcrawl ingester module extracting FX field from plaintext before executing project reference extraction

34433 11/02/2015 02:26 PM Marek Horst

updating default job properties

34213 02/02/2015 06:22 PM Marek Horst

#1070 updating import_project_concepts_context_ids_csv default value to "fet-fp7,fet-h2020"

34212 02/02/2015 06:21 PM Marek Horst

#1070 introducing support for multiple context identifiers, replacing import_project_concepts_context_id IIS input parameter with import_project_concepts_context_ids_csv

34207 02/02/2015 05:39 PM Marek Horst

updating job.properties

34008 20/01/2015 04:34 PM Marek Horst

#1072 transition fix: replacing forking_skip_imported_data with export

34007 20/01/2015 03:56 PM Marek Horst

#1072 dropping IIS feature filtering out already existing project and dataset references from IIS export

33398 15/12/2014 12:25 PM Marek Horst

updating job.properties

33355 11/12/2014 08:36 PM Marek Horst

updating job.properties

33249 09/12/2014 06:41 PM Marek Horst

#919 renaming DocumentToResearchInitiative to DocumentToConceptId and DocumentToResearchInitiatives to DocumentToConceptIds

33228 09/12/2014 11:02 AM Marek Horst

#1022 introducing PMC extracted document metadata collapser removing duplicates before sending output to PMC citation ingestion module

33184 04/12/2014 04:09 PM Marek Horst

#919 enabling concepts matching for FET projects in mainworkflows: import, export, primary and preprocessing

33105 28/11/2014 06:13 PM Marek Horst

#1017 accepting ExtractedDocumentMetadata instead of DocumentText at PMC citation ingestion input. Aliging integration test and importer workflow.

33098 28/11/2014 04:27 PM Marek Horst

#1022 introducing extracted document metadata collapser at importing phase.
Propagating extracted document mentadata (including PMC ingested metadata) to processing part of workflow what can be exploited by citation matching module.
Introducing citations collapser in last stage of processing phase collapsing ingested citations with matched citations.

32943 21/11/2014 05:50 PM Marek Horst

#1017 introducing new PMC metadata ingestion currently extracing references, journal and pages fields.
Replacing DOM/XPath based citations ingestion with much faster SAX version. Changing pmidtooaid transformer utilizing ExtractedDocumentMetadata instead of parsing XML file. Enabling PMC metadata ingestion in common/import.

32829 17/11/2014 03:45 PM Marek Horst

#963 propagating dataset -> mdstore from import to exporting phase: importer produces DocumentToMDStore datasetore utilized by exporter module. Updating transformer definition to handle DocumentToMDStore instead of Identifier schema

32824 17/11/2014 03:42 PM Marek Horst

updating job.properties

32823 17/11/2014 03:42 PM Marek Horst

updating job.properties

32167 04/11/2014 02:04 PM Marek Horst

updating job.properties

32166 04/11/2014 02:01 PM Marek Horst

updating job.properties

32165 04/11/2014 02:01 PM Marek Horst

updating job.properties

32164 04/11/2014 02:00 PM Marek Horst

updating job.properties

32162 04/11/2014 01:44 PM Marek Horst

updating job.properties

32045 31/10/2014 02:59 PM Marek Horst

updating job.properties: adding metadataextraction_excluded_checksums=4f5cc34f137de4dc89766a9366ca66de,6495a568200b1cee40baa00072b1800a

32043 31/10/2014 02:45 PM Marek Horst

updating job.properties

32042 31/10/2014 02:45 PM Marek Horst

introducing support for active_existence_filter, set to true by default. Setting this parameter to false allows processing contents not having its counterpart among metadata records retrieved from HBase. This solution was required to e.g. process ubiquity contents which were not present in HBase dump metadata.

31835 28/10/2014 02:24 PM Marek Horst

updating job.properties

31759 27/10/2014 06:20 PM Marek Horst

renaming metadataextraction_excluded_ids to more appropriate metadataextraction_excluded_checksums

31758 27/10/2014 06:11 PM Marek Horst

#913 introducing support for max file size parameter, currently checked against Content-Lenght header

31667 23/10/2014 04:13 PM Marek Horst

updating job.properties

31498 20/10/2014 06:03 PM Marek Horst

#757 hooking up ingest_pmc_idmapping_pmidtooaid subworkflow with mainworkflows/common/import. From now on citations are matched by pmid as well.

31434 17/10/2014 06:30 PM Marek Horst

updating job.properties

31428 17/10/2014 03:56 PM Marek Horst

#883 providing blacklisted_objectstores_csv input parameter set to $UNDEFINED$ value by default

31422 17/10/2014 12:54 PM Marek Horst

updating job.properties

31410 16/10/2014 05:48 PM Marek Horst

input port name fix: input_citation->input_citations

31267 10/10/2014 03:37 PM Marek Horst

introducing merge_body_with_updates flag support in common/import, setting to true in statistics workflow

31250 09/10/2014 03:33 PM Marek Horst

introducing regex support in result approver to support iis::* kind of provenance, updating workflow definitions with proper regex values

31228 08/10/2014 06:19 PM Marek Horst

#840 moving IdentifierMapping from importer to common package

31222 08/10/2014 06:12 PM Marek Horst

#840 renaming DeduplicationMapping to more generic IdentifierMapping

31216 08/10/2014 05:56 PM Marek Horst

#757 aligning common importer with current API of PMC citations ingestion

31154 06/10/2014 03:47 PM Marek Horst

#637 renaming document_extractedMetadata algorithm to more descriptive document_affiliations, propagating changes to action set identifier properties names

31033 02/10/2014 02:15 PM Marek Horst

reenabling PMC ingestion when citationmatching flag is set

30981 01/10/2014 06:22 PM Marek Horst

updating job properties

30006 04/09/2014 01:10 PM Marek Horst

setting export_action_set_id_entity_dataset to $UNDEFINED$ by default, this should not be required because dataset reference extraction module might be deactivated. Check will be performed at dataset entity exporter module and when value is not set - exception will be raised.

29982 03/09/2014 05:53 PM Marek Horst

#757 temporarily disabling PMC ingestion until fixing openaire identifiers building process

29967 03/09/2014 11:04 AM Marek Horst

#568, #577 enabling proper citations export by introducing PMC citation ingestion and citation matching outcome merging and grouping for exporting purposes. Introducing union instead of collapser which should be introduced in near future.

29893 28/08/2014 01:38 PM Marek Horst

removing output_citation_pmc port duplicate

29835 22/08/2014 05:38 PM Marek Horst

removing common import input parameters which are not required in this context

29827 22/08/2014 02:34 PM Marek Horst

introducing trust_level_threshold support in statistics workflow

29826 22/08/2014 02:27 PM Marek Horst

introducing trust_level_threshold support in common import workflow

29821 22/08/2014 01:13 PM Marek Horst

providing default value for action_set_id_entity_dataset set to $UNDEFINED$. This change is required when exporting in statistics export mode where no entities are exported and such parameter should not be required.

29819 22/08/2014 12:56 PM Marek Horst

introducing dedicated statistics mainworkflow encapsulating importing, processing and exporting phases. This workflow was introduced explicitly for statistics purposes because we want to operate over InformationSpace imported data in contrary to primary workflow where some of the statistics input was inferred and it wasn't clear whether it will become part of InformationSpace.