Core Data Model » History » Revision 1
Revision 1/32
| Next »
Paolo Manghi, 27/04/2015 04:29 PM
OpenAIRE core data model¶
The main entities of the OpenAIRE information space are: datasets, publications, persons, organisations, funders, funding streams, projects, and data sources.
In our reasoning we generalize the concept of datasets and publications to that of project result, so as to be able of including further kinds of research outputs. OpenAIRE initially proposes two kinds of results: datasets (e.g., experimental data, software products) and publications. But others can be added in the future (e.g., patents). Besides, project results are always associated to one or more instances of the results, in the sense that different “physical representations” of the same result may exist. For example, the same publication may be kept in two different repositories, both exposing the payload file (e.g., PDF) at different internet locations (URLs). Morover, an instance of a result is represented as a combination of one or more web resources relative to the sub-parts of the result and of the internet data sources from which such resources are made available.
Similarly, we extend the notion of authors of publications or datasets to that of persons, to include in the same set people connected to project fundings or organizations. For example “authorship” relationships between results and persons, which represent the fact that a given person has (co-)authored a given result while being affiliated with a given organization.
Organizations include companies, research centers or institutions involved as project partners or as responsible of operating data sources. Information about organizations will be initially collected from CORDA and CRIS systems, as being related to projects, or be ingested by users, for example to complete authorships information in the database.
Of crucial interest to OpenAIRE is also the identification of the funders (e.g. European Commission, WellcomeTrust, FCT Portugal, NWO The Netherlands) which co-funded the projects that have led to a given result. Funders can be associated to a list of funding streams (e.g. FP7 for the EC), which identify the strands of fundings comprised by the funding stream. Funding streams can be nested to for a tree of subfunding streams, and projects are typically associated to the fudnding stream “leaves” of such trees.
Finally, OpenAIRE entity instances are created out of data collected from various data sources of different kinds, such as publication repositories, dataset archives, CRIS systems, etc. Data sources export information packages (e.g., XML records, HTTP responses, RDF data, JSON) that may contain information on one or more of such entities and possibly relationships between them. It is important, once each piece of information is extracted from such packages and inserted into the information space as an entity, for such pieces to keep provenance information relative to the originating data source. This is to give visibility to the data source, but also to enable the reconstruction of very the same piece of information if problems arise.
The OpenAIRE data model is inspired by the CERIF data model. The following section first describe CERIF's “semantic layer” mechanism and then introduce a categorisation of the entities
In the following we shall introduce the meta-concepts of CERIF semantic layer
OpenAIRE relationships and the CERIF semantic layer¶
According to this notion, (i) “horizontal” classification of entities (e.g., by vocabularies of terms) is not modeled through properties associated to given controlled vocabularies and (ii) semantic relationships between entities are not modeled by adding dedicated relationships. In both cases, CERIF introduces a flexible modeling mechanism which allows injecting classification semantics into “semantics-agnostic” entities and relationships. The mechanism is obtained by introducing two entities Schemes and Classes such that:
- Class A Class represents one term of a classification, e.g., vocabulary, taxonomy, under a given Scheme. As such it is characterized by the following properties: a Code, which represents the persistent identifier associated to the term (e.g., real-world classifications, such as ISO vocabularies for countries, have a standard identification code for terms), a name, an acronym, a description, a StartDate, and an EndDate.
- Scheme A Scheme identifies the existence of a classification scheme, which is modeled as a set of Class objects. A Scheme is characterized by the following properties: a Code, which represents the persistent identifier associated to the Scheme (e.g., real-world schemes, such as taxonomies, may be have a standard identification code), a name, an acronym, a description, a StartDate, and an EndDate.
According to the CERIF interpretation Classes and Schemes can be themselves interlinked to form arbitrary complex lattices of Classes and Schemes, respectively. In OpenAIRE we adopt a lighter interpretation and exploit such mechanisms to dynamically inject relationship semantics and vocabularies into the data model.
OpenAIRE classes of entities¶
The entities in the data model can be grouped in the following way:- Main entities: the entities whose information is continuously and incrementally fed to the information space; namely Result (Publication and Dataset), Person, Organization, DataSource (Repository, Dataset Archive, CRIS, Aggregator, Entity Registry), Projects;
- Structural entities: the entities added to the model to represent complex information about an entity; namely Instances, WebResources, Titles, Dates, Identities, and Subjects;
- Static entities: entities whose content is inserted in the information space at some point in time; namely Funding, Class, and Scheme;
- Linked entities (CERIF notation): relationship entities, used to connect in a semantic-agnostic way two or more main entities; namely, those denoted by an Entity1_Entity2 notation.
OpenAIRE main entities¶
- Results
- Persons
- Projects *
Updated by Paolo Manghi over 9 years ago · 1 revisions