Project

General

Profile

OpenAIRE Research Graph » History » Version 3

Alessia Bardi, 05/11/2021 01:10 PM
Copied content from dedup document (to be updated)

1 1 Alessia Bardi
h1. The OpenAIRE Research Graph
2
3
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. 
4
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
5
6
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
7
8
As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10K data sources trusted by scientists, including:
9
* Repositories registered in OpenDOAR or re3data.org
10
* Open Access journals registered in DOAJ
11
* Crossref
12
* Unpaywall
13
* ORCID
14
* Microsoft Academic Graph
15
* Datacite
16
17
After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for OpenAIRE MONITOR (https://monitor.openaire.eu), the Open Science Observatory (https://osobservatory.openaire.eu), made discoverable via OpenAIRE EXPLORE (https://explore.openaire.eu) and programmatically accessible as described at https://develop.openaire.eu. 
18 3 Alessia Bardi
Json dumps are also published on Zenodo.
19 1 Alessia Bardi
20 3 Alessia Bardi
TODO: image of high-level data model (entities and semantic relationships, we can draw here: https://docs.google.com/drawings/d/1c4s7Pk2r9NgV_KXkmX6mwCKBIQ-yK3_m6xsxB-3Km1s/edit)
21
22 1 Alessia Bardi
h2. Graph Data Dumps
23
24 2 Alessia Bardi
In order to facilitate users, different dumps are available. All are available under the "Zenodo community called OpenAIRE Research Graph":https://zenodo.org/communities/openaire-research-graph.
25
Here we provide detailed documentation about the full dump:
26
27
* Json dump: https://doi.org/10.5281/zenodo.3516917
28
* Json schema: https://doi.org/10.5281/zenodo.4238938 
29
30
[[Json schema]]
31
[[FAQ]]
32 1 Alessia Bardi
33
h2. Graph provision processes
34
35 3 Alessia Bardi
h3. Deduplication business logic
36
37
h4. Deduplication business logic for research results 
38
39
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces. 
40
41
Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates.
42
OpenAIRE supports different deduplication strategies based on the type of results.
43
44
*Methodology overview*
45
46
The deduplication process can be divided into two different phases: 
47
* Candidate identification (clustering)
48
* Candidate matching (blocking)
49
50
The implementation of each phase is different based on the type of results that are being processed.
51
52
53
*Strategy for publications*
54
55
TODO: UPDATE
56
57
_Candidate identification (clustering)_
58
Due to the high number of metadata records collected by OpenAIRE, it would not be feasible to compute all possible comparisons between all metadata records.
59
The goal of this phase is to limit the number of comparisons by creating groups (or clusters) of records that are likely “similar”. Every record can be added to more than one group. The idea is that we do not need to make comparisons between two publications whose title is completely different.
60
The decision of inclusion in a group is performed by 3 clustering functions (see Section “Clustering Functions” for details about each clustering function) that works on titles and DOIs in order to create clusters:
61
whose publications have similar titles (clustering functions “Suffixprefix” and “ngrampairs”);
62
whose publications have the same DOIs even if it is written in lower/upper/mixed case letters (clustering functions “lowercase”).
63
64
_Candidate matching (blocking)_
65
Once the clusters have been composed, the algorithm proceeds with the comparisons.
66
Still, the number of records in one cluster may be too high to be feasible to compute all possible comparisons within one cluster, hence we have introduced the concept of “sliding window”.
67
With this mechanism, a window of a certain size (currently set at 200) is slid over the group and only records into the window are compared with each other. In order to maximize the probability for duplicated records to fall within the sliding window bounds, records in the cluster are also ordered based on the values of some of their attributes. Specifically, publication metadata records of each cluster are ordered lexicographically on a normalized version of their titles.
68
Each record in the sliding window is compared to all other records in the sliding window. Comparisons are driven by a decisional tree that can be depicted as in figure 1.
69
Sufficient conditions (in orange in figure 1) are applied: if the PIDs of the two records are the same (applying the condition function “pidMatch”), then the two records are duplicates. Otherwise, the algorithm proceeds with the next conditions (in yellow in figure 1);
70
If the titles of the two records contain numbers and these numbers are not the same, then the records are no duplicates (condition function “titleVersionMatch”);
71
If the two records contain different numbers of authors (condition function “sizeMatch” on metadata field “author”), then the records are no duplicates. If the two yellow conditions are satisfied, the algorithm proceeds with the last comparison (in blue in figure 1);
72
The titles of the two records are normalised and compared for similarity by applying the Levenstein distance algorithm (condition function called “LevenshteinTitle”). The algorithm returns a number in the range [0,1], where 0 means “very different” and 1 means “equal”. If the distance is greater than or equal 0,99 the two records are identified as duplicates.
73
74
75
76
*Strategy for datasets*
77
78
*Strategy for software*
79
80
*Strategy for other types of research products*
81
82
*Clustering functions*
83
84
NgramPairs
85
It produces a list of concatenations of a pair of ngrams generated from different words.
86
Example:
87
Input string: “Search for the Standard Model Higgs Boson”
88
Parameters: ngram length = 3
89
List of ngrams: “sea”, “sta”, “mod”, “hig”
90
Ngram pairs: “seasta”, “stamod”, “modhig”
91
SuffixPrefix
92
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string.
93
Example:
94
Input string: “Search for the Standard Model Higgs Boson”
95
Parameters: suffix and prefix length = 3
96
Output list: “ardmod” (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”)
97
98
*Conditional functions*
99
PidMatch
100
Compares two sets of persistent identifiers [type, value]. The condition is satisfied when the majority of PIDs are in common.
101
SizeMatch
102
Compares the number of occurrences of two repeatable fields. The condition is satisfied when the number matches.
103
TitleVersionMatch
104
Compares two titles. The condition is satisfied when the numbers (Arabic or Romans) contained in the title fields are the same.
105
106
107
108
h3. TODOs
109
110 1 Alessia Bardi
* OpenAIRE entity identifier & PID mapping policy
111 3 Alessia Bardi
* Aggregation business logic by major sources:
112 1 Alessia Bardi
** Unpaywall integration
113
** Crossref integration 
114
** ORCID integration
115
** Cross cleaning actions: hostedBy patch
116
** Scholexplorer business logic (relationship resolution)
117
** DataCite
118
** EuropePMC
119
** more….
120
* Deduplication business logic
121 3 Alessia Bardi
** -For research outputs- 
122 1 Alessia Bardi
** For research organizations 
123
* Enrichment
124
** Mining business logic
125
** Deduction-based inference 
126
** Propagation business logic
127
* Post-cleaning business logic
128
* FAQ