OpenAIRE core data model¶
The main entities of the OpenAIRE information space are: datasets, publications, persons, organisations, funders, funding streams, projects, and data sources.
Figure 1 OpenAIRE Data Model: core entities and relationships.
In our reasoning we generalize the concept of datasets and publications to that of project result, so as to be able of including further kinds of research outputs. OpenAIRE initially proposes two kinds of results: datasets (e.g., experimental data, software products) and publications. But others can be added in the future (e.g., patents). Besides, project results are always associated to one or more instances of the results, in the sense that different “physical representations” of the same result may exist. For example, the same publication may be kept in two different repositories, both exposing the payload file (e.g., PDF) at different internet locations (URLs). Morover, an instance of a result is represented as a combination of one or more web resources relative to the sub-parts of the result and of the internet data sources from which such resources are made available.
Similarly, the notion of authors of publications or datasets is extended to that of persons, to include in the same set people connected to project fundings or organizations. For example “authorship” relationships between results and persons, which represent the fact that a given person has (co-)authored a given result while being affiliated with a given organization.
Organizations include companies, research centers or institutions involved as project partners or as responsible of operating data sources. Information about organizations will be initially collected from CORDA and CRIS systems, as being related to projects, or be ingested by users, for example to complete authorships information in the database.
Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) which co-funded the projects that have led to a given result. Funders can be associated to a list of funding streams (e.g. FP7 for the EC), which identify the strands of fundings comprised by the funding stream. Funding streams can be nested to for a tree of subfunding streams, and projects are typically associated to the fudnding stream “leaves” of such trees.
Finally, OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. It is important, once each piece of information is extracted from such packages and inserted into the information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of very the same piece of information if problems arise.
Figure 2 OpenAIRE Data Model: core entities and provenance information.
OpenAIRE and the CERIF semantic layer¶
For more check CERIF's web site: http://www.eurocris.org
According to the CERIF's data model vision: (i) “horizontal” classification of entities (e.g., by vocabularies of terms) is not modeled through properties associated to given controlled vocabularies and (ii) semantic relationships between entities are not modeled by adding dedicated relationships. In both cases, CERIF introduces a flexible modeling mechanism which allows injecting classification semantics into “semantics-agnostic” entities and relationships. The mechanism is obtained by introducing two entities Schemes and Classes such that:
- Class A Class represents one term of a classification, e.g., vocabulary, taxonomy, under a given Scheme. As such it is characterized by the following properties: a Code, which represents the persistent identifier associated to the term (e.g., real-world classifications, such as ISO vocabularies for countries, have a standard identification code for terms), a name, an acronym, a description, a StartDate, and an EndDate.
- Scheme A Scheme identifies the existence of a classification scheme, which is modeled as a set of Class objects. A Scheme is characterized by the following properties: a Code, which represents the persistent identifier associated to the Scheme (e.g., real-world schemes, such as taxonomies, may be have a standard identification code), a name, an acronym, a description, a StartDate, and an EndDate.
According to the CERIF's definition, Classes and Schemes can be themselves interlinked to form arbitrary complex lattices of Classes and Schemes, respectively.
In OpenAIRE we adopt a lighter interpretation, by introducing the pair Scheme/Class whenever we need to introduce a property of type Qualifier, i.e. a property whose value comes from a controlled vocabulary, or a relationship between core entities in the model. Such mechanisms allow to flexibly inject relationship semantics and vocabularies into the data model.
OpenAIRE entities, relationships and types¶The entities in the data model belong to the following categories:
- Core entities: the entities whose information is continuously and incrementally fed to the information space and is of interest to OpenAIRE end-users; namely Result (Publication and Dataset), Person, Organization, DataSource (Repository, Dataset Archive, CRIS, Aggregator, Entity Registry), Projects, Funder, Funding Stream;
- Linking entities: entities used to model relationships, used to connect in a semantic-agnostic way two or more main entities; namely, those denoted by an Entity1_Entity2 notation (see aforementioned CERIF semantic layer).
- Types: types are used to define structured values for entity properties. In fact, structured values do not correspond to objects, i.e. do not have an identity, and cannot be shared by different objects.