Project

General

Profile

OpenAIRE Research Graph » History » Version 33

Alessia Bardi, 10/11/2021 02:56 PM
Moving to dedicated section (Deduplication)

1 1 Alessia Bardi
h1. The OpenAIRE Research Graph
2
3 25 Claudio Atzori
{{>toc}}
4
5 1 Alessia Bardi
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. 
6
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
7
8
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
9
10
As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10K data sources trusted by scientists, including:
11
* Repositories registered in OpenDOAR or re3data.org
12
* Open Access journals registered in DOAJ
13 30 Claudio Atzori
* [[DOIBoost#Inputs|Crossref]]
14
* [[DOIBoost#Inputs|Unpaywall]]
15
* [[DOIBoost#Inputs|ORCID]]
16
* [[DOIBoost#Inputs|Microsoft Academic Graph]]
17 1 Alessia Bardi
* Datacite
18
19
After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for OpenAIRE MONITOR (https://monitor.openaire.eu), the Open Science Observatory (https://osobservatory.openaire.eu), made discoverable via OpenAIRE EXPLORE (https://explore.openaire.eu) and programmatically accessible as described at https://develop.openaire.eu. 
20 3 Alessia Bardi
Json dumps are also published on Zenodo.
21 1 Alessia Bardi
22 3 Alessia Bardi
TODO: image of high-level data model (entities and semantic relationships, we can draw here: https://docs.google.com/drawings/d/1c4s7Pk2r9NgV_KXkmX6mwCKBIQ-yK3_m6xsxB-3Km1s/edit)
23
24 1 Alessia Bardi
h2. Graph Data Dumps
25
26 2 Alessia Bardi
In order to facilitate users, different dumps are available. All are available under the "Zenodo community called OpenAIRE Research Graph":https://zenodo.org/communities/openaire-research-graph.
27
Here we provide detailed documentation about the full dump:
28
29
* Json dump: https://doi.org/10.5281/zenodo.3516917
30
* Json schema: https://doi.org/10.5281/zenodo.4238938 
31
32 26 Claudio Atzori
h3. [[Json schema]]
33
34
h3. [[FAQ]]
35 1 Alessia Bardi
36
h2. Graph provision processes
37
38 29 Alessia Bardi
TODO:
39
3. Processes
40
* harvesting NOT SURE WHAT IOANNA WANTS: is what we have on graph.openaire.eu OK?
41
* transformation NOT SURE WHAT IOANNA WANTS: is what we have on graph.openaire.eu OK?
42
* -doiboost- (not in processing, in [[#Aggregation business logic by major sources]])
43
* direct zenodo updates
44
* deduplication STARTED
45
* inference
46
* Funder ingestion ("Harry is speaking with funders, gets the list of projects, inference rules, ...")
47
For the processes we need a description of what it does, what's the input, which part of the graph it affects, and anything of importance ("if there is no input value for X, then we assume Y and assign the value Z to A")
48
49 16 Alessia Bardi
[[OpenAIRE entity identifier and PID mapping policy]]
50 10 Alessia Bardi
51 11 Alessia Bardi
h3. Aggregation business logic by major sources
52
53 28 Alessia Bardi
TODO:
54
2. input sources
55
* repositories
56
* journals
57
* -DOIBoost-
58
* * -MAG-
59
* * -Crossref-
60
* * -Unpaywall-
61
* projects
62
* organizations
63
* openorgs
64
* open citations
65
* openAPC
66
* etc etc
67
68
For each input source class, we need to know what's in there, the format, approximate numbers, peculiarities, important aspects of the data (e.g. "Crossref provides us with the authoritative list of publishers!"). Anything else of importance?
69
70
71 20 Alessia Bardi
DOIBoost is the intersection among Crossref, Unpaywall, Microsoft Academic Graph and ORCID
72 1 Alessia Bardi
73 27 Claudio Atzori
h4. [[DOIBoost]]
74 1 Alessia Bardi
75 27 Claudio Atzori
h4. [[Datacite]]
76 11 Alessia Bardi
77 27 Claudio Atzori
h4. [[EuropePMC]]
78
79 21 Alessia Bardi
The strategy for the resolution of links between publications and datasets is defined by Scholexplorer
80 27 Claudio Atzori
81 21 Alessia Bardi
[[Scholexplorer]]
82
83 33 Alessia Bardi
[[Deduplication]]
84 3 Alessia Bardi
85 31 Claudio Atzori
86 1 Alessia Bardi
87 25 Claudio Atzori
h2. TODOs
88 1 Alessia Bardi
89 10 Alessia Bardi
* OpenAIRE entity identifier & PID mapping policy (started, to be completed by Claudio and/or Michele DB)
90 1 Alessia Bardi
* Aggregation business logic by major sources:
91 15 Alessia Bardi
** -Unpaywall integration-
92
** -Crossref integration- 
93
** -ORCID integration-
94
** -Cross cleaning actions: hostedBy patch-
95 1 Alessia Bardi
** Scholexplorer business logic (relationship resolution)
96 3 Alessia Bardi
** DataCite
97 1 Alessia Bardi
** EuropePMC
98
** more….
99 10 Alessia Bardi
* Deduplication business logic (started, to be completed by Michele DB)
100 9 Alessia Bardi
** For research outputs ( -publications- , datasets, software, orp)
101 1 Alessia Bardi
** For research organizations 
102
* Enrichment
103
** Mining business logic
104
** Deduction-based inference 
105
** Propagation business logic
106
* Post-cleaning business logic
107
* FAQ