Open Science is gradually becoming the modus operandi in research practices, affecting the way researchers
collaborate and publish, discover, and access scientific knowledge.
Scientists are increasingly publishing research results beyond the article, to share all scientific
products (metadata and files) generated during an experiment, such as datasets, software, experiments.
They publish in scholarly communication data sources (e.g. institutional repositories, data archives,
software repositories), rely where possible on persistent identifiers (e.g. DOI, ORCID, Grid.ac, PDBs),
specify semantic links to other research products (e.g. supplementedBy, citedBy, versionOf), and possibly
to projects and/or relative funders.
By following such practices, scientists are implicitly constructing the Global Open Science Graph, where
by "graph" we mean a collection of objects interlinked by semantic relationships.
The OpenAIRE Research Graph includes metadata and links between scientific products (e.g. literature,
datasets, software, and "other research products"), organizations, funders, funding streams, projects,
communities, and (provenance) data sources - the details of the graph data model can be found
in Zenodo.org.
The Graph is available and obtained as an aggregation of the metadata and links collected from ~70.000
trusted sources, further enriched with metadata and links provided by:
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the world, including Open Access institutional repositories, data archives, journals. All the metadata records (i.e. descriptions of research products) are put together in a data lake, together with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national and international funders. Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications enrich the content of the data lake with links between research results and projects, author affiliations, subject classification, links to entries from domain-specific databases. Duplicated organisations and results are identified and merged together to obtain an open, trusted, public resource enabling explorations of the scholarly communication landscape like never before.
The aggregation processes are continuously running and apply vocabularies as they are in a given
moment of time.
It could be the case that a vocabulary changes after the aggregation of one data source has
finished,
thus the aggregated content does not reflect the current status of the controlled vocabularies.
In addition, the integration of ScholeXplorer and DOIBooost and some enrichment processes
applied
on the raw
and on the de-duplicated graph may introduce values that do not comply with the current status
of
the OpenAIRE controlled vocabularies.
For these reasons, we included a final step of cleansing at the end of the workflow
materialisation.
The output of the final cleansing step is the final version of the OpenAIRE Research Graph.
Lorem ipsum...
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party applications and organizations, such as:
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and producing the charts for funders, research initiative, infrastructures, and policy makers that you can see on MONITOR. Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring the funding and research impact and the uptake of Open Science publishing practices, such as Open Access publishing of publications and datasets, availability of interlinks between research products, availability of post-print versions in institutional or thematic Open Access repositories, etc.
The OpenAIRE graph operates based on a vast variety of hardware and software. As of December 2019, the hardware infrastructure is the following: