Subsystems » History » Version 8
Paolo Manghi, 04/05/2015 12:43 PM
1 | 1 | Paolo Manghi | h1. OpenAIRE infrastructure sub-systems |
---|---|---|---|
2 | 2 | Paolo Manghi | |
3 | The OpenAIRE infrastructure features a number of sub-systems, dedicated to four main activities: |
||
4 | |||
5 | 6 | Paolo Manghi | * [[aggregation_subsystem|Aggregation sub-system]]: collection of [[information_package|information packages]] and publication texts (e.g. PDFs, XMLs, HTMLs) from data sources; based on the typology of such packages (e.g. Dublin Core metadata records, DataCite metadata records, CERIF-XML metadata records, proprietary formats), the system transforms them onto "cleaned" metadata records with uniform structure and semantics, matching the specification of the OpenAIRE data model; |
6 | * [[deduplication_subsystem| De-duplication sub-system]]: given as input the native information space graph (as generated by the data provision sub-system), the system identifies duplicates among the objects of the same entity type; for each entity, the system generates a set of similarity relationships between pairs of objects identified as duplicates, which will be used by the data publishing subsystem to generate a disambiguated information space; |
||
7 | * [[informationinference_subsystem|Information inference sub-system]]: given as input the last public information space graph (disambiguated and enriched by inference in the last round) and the publications full-texts, the system applies a number of mining algorithms (i.e. "modules"); for each mining module the system produces a set (called ActionSet) of inferred information, which will be used by the data publishing sub-system to enrich the information space graph; |
||
8 | 7 | Alessia Bardi | * [[dataprovision_subsystem|Data provision sub-system]]: given as input the cleaned metadata records as yielded by the aggregation sub-system, the similarity relationships as (last) yielded by the de-duplication sub-system, and the inference ActionSets as (last) yielded by the information inference sub-system, the data provision system: |
9 | *# populates an initial bare-aggregation information space graph, |
||
10 | *# enriches the graph with similarity relationships and runs an object merging algorithm to remove duplicates, |
||
11 | *# enriches the graph with inferred information, |
||
12 | *# instantiates (publishes) the graph over three back-ends serving different use-cases: full-text index, OAI-PMH publisher, PostgreSQL statistics database (a LOD back-end is being developed in OpenAIRE2020). |
||
13 | 4 | Paolo Manghi | |
14 | 5 | Paolo Manghi | p=. !{width:80%}openaire_infra.png! |
15 | 4 | Paolo Manghi | _Figure 1 OpenAIRE2020 infrastructure sub-systems._ |
16 | 8 | Paolo Manghi | |
17 | The three subsystems [[aggregation_subsystem|Aggregation sub-system]], [[deduplication_subsystem| De-duplication sub-system]], and [[dataprovision_subsystem|Data provision sub-system]] are those contributing to the population and enrichment of the OpenAIRE information space graph. In particular, such systems bring in: |
||
18 | |||
19 | * objects of different entities collected from data sources (originally collected as information packages); |
||
20 | * new objects ("representative objects") obtained as merge of duplicate object collected from data sources; |
||
21 | * new objects, relationships, or attributes inferred by mining PDFs of publications or by mining the information space graph itself. |
||
22 | |||
23 | The OpenAIRE data model entities and their properties are therefore extended with [type_provenance|Provenance] fields in order to represent the information needed to describe the origin and the process that led the bit of information into the graph. |