Project

General

Profile

Aggregation subsystem » History » Version 2

Paolo Manghi, 13/05/2015 03:09 PM

1 1 Paolo Manghi
h1. Aggregation subsystem
2
3
As shown in Figure 1 the OpenAIRE infrastructure collects objects from a set of data sources of different typologies, e.g. repositories, CRISs, dataset archives, aggregators, entity registries, journals. All these contain information relative to different interrelated objects of the OpenAIRE data model. In particular, as shown in Figure 1, data sources of the same typology may deliver metadata records that contain information relative to different entities. For example, an OpenAIRE compliant repository delivers information packages (i.e., metadata records) which contain information about the publication result, the persons who created such result, the projects funding such result, and the instances relative to the result; while a DRIVER compliant repository does not contain information about project entities.
4
In the following we shall call original entities the entities collected from data sources, hence from “authoritative” providers of data, onto the Information Space. The layer of original entities includes entities and relationships between them as collected from the different data sources (i.e., mapped from their original structure noto the one of the OpenAIRE data model). The layer is “stateless”, in the sense that entities have “reproducible” identifiers derived by combining identifiers of the entities in the original data sources with a data source identifier assigned by OpenAIRE. In other words, if the same entity or relationships is collected more than once from the same data source, it will be transparently overridden. 
5 2 Paolo Manghi
6
p=. !{width:65%}informationPackages_Objects.jpg!
7
Figure 1 – Entity layers: native objects
8
9 1 Paolo Manghi
With respect to data population, hence with the process of collecting data from data sources and map them onto the information space, this section we will introduce the notions of:
10
11
* Information packages and population workflows: entity ingestion from data sources into the information space;
12
* How to assign a stateless (and permanent) identifiers to the entities when they enter the information space;
13
14
For each of these aspects, we shall provide and explanation of the problem, an extension of the data model to handle the problem, and, where necessary, a solution to the problem using the updated model. 
15
16
h2. Information packages
17
18
In OpenAIRE entities are collected from external data sources in the form of “information packages”. This notion aims at generalizing the OpenAIRE scenario of bibliographic metadata records import from repository data sources to other data source typologies and other types of (primary) entities. In particular, we shall call information package a file in some interpretable format (e.g. XML), which contains identifier and information (e.g., properties) relative to one entity, called primary entity, of a given entity type. An information package may contain information (but not necessarily the identifier) relative to other entities (of likely different entity types), called sub-entities, which must be directly or indirectly associated with the package primary entity. Figure 7 shows an example of an information package whose primary entity is 1: for example, an information package from OpenDOAR is relative to a repository data source and can be identified by the relative OpenDOAR identifier. Its sub-entities are those from 2 to 6: for example, an OpenDOAR package also contains information about the organization responsible for the repository data source.
19
20
h2. Population Workflows 
21
22
Original entities are collected from information packages originating from various data sources. We call population workflow the process that takes the information package from a data source, extracts its primary entity and its related entities, and stores them into the OpenAIRE information space. A workflow is therefore dependent on:
23
24
* The data source typology (including the expert-validate entity pool);
25
* The access method, namely (i) protocol required to get the data (e.g., OAI-PMH, JDBC, FTP) and (ii) relative access configuration (e.g., entry point, parameters, etc.);
26
* The primary entity type of the information packages;
27
* The XML structure of the information packages at hand, which depends on the primary entity type. 
28
29
Note that data sources of the same typology may deliver information packages relative to different primary entity types (in general we can assume they will do it from different access points). For example, CRIS systems may expose through OAI-PMH both publication or project primary entities. 
30
Information package structure (OpenAIRE guidelines) The OpenAIRE infrastructure will includes services capable of handling automated collection of entities from data sources according to given population workflows. To this aim, the OpenAIRE guidelines will describe which XML information packages structure should be expected for each population workflow triple
31
<datasource typology, access method, primary entity type> → XML information package structure
32
available to the system. WP6 will develop services to automatically process the information packages and insert the relative entities onto the information space.
33
Information package heterogeneity and harmoniztion Unfortunately, the “raw” information packages exported by data sources will likely not match the information package structures to be identified in the previous step. For example, CRIS systems generally support OAI-PMH harvesting of information, but may export information packages relative to the same entities (e.g., projects, publications) in different XML formats. To this aim, WP6 will update its transformation services in order to map the specific structures exposed by a data source through a given workflow so that they match the expected information package structure.
34 2 Paolo Manghi
35
p=. !{width:65%}informationPackages_Ingestion.jpg!
36 1 Paolo Manghi
Figure 2 – Information Packages: ingestion workflows (AM = Access Method, DS = Data source typology, PET = Primary Entity Type, F = information package Format): data sources of the same typology export the same primary entity of the same type through different “raw” information package format structures.
37
38
h2. Identity of original entities
39
40
Original entities reach the information space from different workflows. Once they enter the information space they must be assigned a unique “stateless” identifier. The data sources of such entities are not under the OpenAIRE infrastructure control and may in any moment decide to delete, update, or add new entities or relationships between them. Hence, it is particularly important to make sure such identifiers are generated from the incoming information packages in a stateless and stable way that is “if the same entity enters the information space at different times, it will be assigned the same identifier”. 
41
To this aim, the OpenAIRE infrastructure constructs indentifiers for primary entities and sub-entities in an information package by combining three levels of scope: data sources, relative primary entities, and sub-entities of such primary entities. More specifically:
42
43
* Infrastructure scope: all data sources are registered and assigned a unique identifier in OpenAIRE; 
44
* Data source scope: information packages from the same data source contain one primary entity with an identifier which is unique in the context of the data source;
45
* Primary entity scope: information packages may contain a number of sub-entities relative to the primary entity; unlike primary entities, sub-entities may not necessarily come with an identifier (data source scope)  and can be generally uniquely identified in the scope of the primary entity based on their properties. The process of identification of such “unique information” is very much dependent on the given information package structure.
46
47
Primary entity identifiers The process of generation of stateless identifiers for primary entities is based on a data source scope strategy. Independently of the workflows, the type of entity, and the data source kind, primary entity identifiers are always obtained by concatenating the name space prefix of the data source with the primary entity identifier (using and underscore): 
48
49
nameSpacePrefix_mainEntityID
50
51
Although one may consider an infrastructure scope strategy, where the assumption is that all primary entities identifers are persistent identifiers, therefore unique across several data sources, in OpenAIRE this is not generally the case, hence we adopt a common and safe strategy of identifier generation.
52
Sub-entity identifiers Assigning identifiers to sub-entities can be performed following different strategies, some more “optimistic” and some more “pessimistic” about the ability of inferring unambiguous “unique information” for sub-entities from their properties in the information package. For example, one may assume that:
53
* Infrastructure scope strategy: sub-entities with the same “unique information” are collapsed in the same entity across different data sources (infrastructure scope). Entity splitting will identify and solve possible entity “overloads” in a second stage. 
54
<pre>
55
uniqueInformation
56
</pre>
57
* Data source scope strategy: Sub-entities with the same “unique information” are collapsed in the same entity but only within the same data source scope. Entity splitting will identify and solve possible entity “overloads” in a second stage.
58
<pre>
59
nameSpacePrefix_uniqueInformation
60
</pre>
61
* Primary entity scope strategy: Sub-entities with the same “unique information” are collapsed in the same entity but only within the same primary entity scope (see Figure 9). De-duplication of entities will solve redundancy in a second stage.
62
<pre>
63
nameSpacePrefix_mainEntityID_uniqueInformation
64
</pre>
65
66
Assigning identifiers to sub-entities follows different strategies depending on the specific workflow, hence the relative information packages structure.
67
68 2 Paolo Manghi
p=. !{width:65%}identifiers_primaryObjectScope.jpg!
69 1 Paolo Manghi
Figure 3 – Assigning unique identifiers to sub-entities: primary entity scope
70
71 2 Paolo Manghi
p=. !{width:65%}identifiers_dataSourceScope.jpg!
72 1 Paolo Manghi
Figure 4 – Assigning unique identifiers to sub-entities: data source scope