Project

General

Profile

OpenAIRE Research Graph » History » Version 11

Alessia Bardi, 05/11/2021 03:51 PM
Description for DOIBoost

1 1 Alessia Bardi
h1. The OpenAIRE Research Graph
2
3
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. 
4
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
5
6
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
7
8
As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10K data sources trusted by scientists, including:
9
* Repositories registered in OpenDOAR or re3data.org
10
* Open Access journals registered in DOAJ
11
* Crossref
12
* Unpaywall
13
* ORCID
14
* Microsoft Academic Graph
15
* Datacite
16
17
After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for OpenAIRE MONITOR (https://monitor.openaire.eu), the Open Science Observatory (https://osobservatory.openaire.eu), made discoverable via OpenAIRE EXPLORE (https://explore.openaire.eu) and programmatically accessible as described at https://develop.openaire.eu. 
18 3 Alessia Bardi
Json dumps are also published on Zenodo.
19 1 Alessia Bardi
20 3 Alessia Bardi
TODO: image of high-level data model (entities and semantic relationships, we can draw here: https://docs.google.com/drawings/d/1c4s7Pk2r9NgV_KXkmX6mwCKBIQ-yK3_m6xsxB-3Km1s/edit)
21
22 1 Alessia Bardi
h2. Graph Data Dumps
23
24 2 Alessia Bardi
In order to facilitate users, different dumps are available. All are available under the "Zenodo community called OpenAIRE Research Graph":https://zenodo.org/communities/openaire-research-graph.
25
Here we provide detailed documentation about the full dump:
26
27
* Json dump: https://doi.org/10.5281/zenodo.3516917
28
* Json schema: https://doi.org/10.5281/zenodo.4238938 
29
30
[[Json schema]]
31
[[FAQ]]
32 1 Alessia Bardi
33
h2. Graph provision processes
34
35 10 Alessia Bardi
h3. OpenAIRE entity identifier & PID mapping policy
36
37
(copied from https://docs.google.com/document/d/1PnvZpmhbanJu3AeOT-zdIyMKIHoGKC4_Z0UtDFDZAeM/edit#)
38
39
OpenAIRE assign internal identifiers for each object it collects. 
40
By default, the internal identifier is generated as @sourcePrefix::md5(localId)@ where 
41
* @sourcePrefix@ is a namespace prefix of 12 chars assigned to the data source at registration time
42
* @localid@ is the identifier assigned to the object by the data source
43
44
After years of operation, we can say that:
45
* @localId@ are unstable
46
* objects can disappear from sources
47
* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
48
49
Therefore, when the record is collected from an authoritative source:
50
* the identity of the record is forged using the PID, like @pidTypePrefix::md5(lowercase(doi))@
51
* the PID is added in a @pid@ element of the data model.
52
53
When the record is collected from a source which is not authoritative for any type of PID:
54
* the identity of the record is forged as usual using the local identifier;
55
* the PID, if available, is added as @alternateIdentifier@s
56
57
As of November 2021, the following data sources are used as "PID authorities": 
58
59
| PID Type | Prefix (12 chars) | Authority |
60
| doi | @doi_________@ | Crossref, Datacite, Zenodo |
61
| pmc | @pmc_________@ | Europe PubMed Central, PubMed Central |
62
| pmid | @pmid________@ | Europe PubMed Central, PubMed Central |
63
| arXiv | @arXiv_______@ | arXiv.org e-Print Archive |
64
| handle | @handle______@ | any repository |
65
66
TODO: WHAT HAPPENS FOR RECORDS WITH BOTH pmc and pmid? pmc wins?
67
68
OpenAIRE also perform duplicate identification (see dedicated section for details).
69
All duplicates are "merged" together in a "representative record" which must be assigned to a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).
70
The following strategy is applied to generate the OpenAIRE identifier of a representative record, to ensure it is as stable as possible:
71
72
TODO
73
74
75 11 Alessia Bardi
h3. Aggregation business logic by major sources
76
77
h4. DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
78
79
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at: 
80
81
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071
82
83
In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
84
The generation of DOIBoost consists in the following phases:
85
86
# Filter Crossref records that:
87
* have blank title
88
* have one of the following publishers: "Test accounts", "CrossRef Test Account"
89
* have no authors with valid names, where valid means: not blank and different from all strings in this list: @List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")@
90
* have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
91
# Intersect Crossref with Unpaywall by DOI (DOIBoost1). The records are enriched with 
92
* TODO: AUTHORS?
93
* one @instance@ with 
94
** the @best_oa_location@ of Unpaywall
95
** @color@ set as follows: @green@ if the host is a repository; @gold@ if the host is publisher and the journal is open access; @hybrid@ if the host is publisher, the journal is not open access but there is a license; @bronze@ if no license is available.
96
# Intersect DOIBoost1 with ORCID (DOIBoost2). The records are enriched with the ORCID identifiers of their authors
97
# Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3). The records are enriched with:
98
* abstracts
99
* MAG identifiers of authors
100
* affiliation relationships
101
* subjects (MAG FieldsOfStudy)
102
* conference or journal information (in the @journal@ field) TODO: or @container@, in case of the dump?
103
* [TO BE REMOVED] instances with URL from MAG
104
# Enrich DOIBoost3 with hosting data sources (@hostedby@) and access right information. In this phase we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a @journal.[l|e]issn@ that match are enriched as follows:
105
* Each instance gain the `hostedby` information. 
106
* If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route.
107
108
The hostedby of records that do not match are set to the "Unknown Repository".
109
110 3 Alessia Bardi
h3. Deduplication business logic
111
112
h4. Deduplication business logic for research results 
113
114
Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces. 
115
116
Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates.
117
OpenAIRE supports different deduplication strategies based on the type of results.
118
119
*Methodology overview*
120
121
The deduplication process can be divided into two different phases: 
122
* Candidate identification (clustering)
123 6 Alessia Bardi
* Decision tree
124 10 Alessia Bardi
* Creation of representative record
125 3 Alessia Bardi
126
The implementation of each phase is different based on the type of results that are being processed.
127
128
129
*Strategy for publications*
130
131 1 Alessia Bardi
_Candidate identification (clustering)_
132 6 Alessia Bardi
133 1 Alessia Bardi
Due to the high number of metadata records collected by OpenAIRE, it would not be feasible to compute all possible comparisons between all metadata records.
134 6 Alessia Bardi
The goal of this phase is to limit the number of comparisons by creating groups (or clusters) of records that are likely “similar”. Every record can be added to more than one group. 
135
The decision of inclusion in a group is performed by 2 clustering functions:
136
* Lowercase: doi (in pid list and alternate identifiers list)
137
* WordsStatsSuffixPrefixChain: suffixprefix with statistics on the full title (number_of_words & number_of_letters%10)
138
Example: 
139
If title is : “Search for the Standard Model Higgs Boson”
140 3 Alessia Bardi
The clustering function produces 2 keys (i.e. adds the publication to two clusters): [5-3-seaardmod, 5-3-rchstadel]
141
142 6 Alessia Bardi
143
_Desicision tree_
144
145
For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied.
146
Cross comparison of the pid lists (in the @pid@ and @alternateid@ elements). If 50% common pids, levenshtein distance on titles with low threshold (0.9).
147
Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99). 
148 3 Alessia Bardi
The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications.
149
150
!dedup-results.png!
151
152 10 Alessia Bardi
_Creation of representative record_
153
154
TODO
155
156
157 3 Alessia Bardi
*Strategy for datasets*
158
159
*Strategy for software*
160
161
*Strategy for other types of research products*
162
163 8 Alessia Bardi
*Clustering functions*
164 3 Alessia Bardi
165
_NgramPairs_
166
It produces a list of concatenations of a pair of ngrams generated from different words.
167
Example:
168
Input string: “Search for the Standard Model Higgs Boson”
169 1 Alessia Bardi
Parameters: ngram length = 3
170 8 Alessia Bardi
List of ngrams: “sea”, “sta”, “mod”, “hig”
171
Ngram pairs: “seasta”, “stamod”, “modhig”
172 3 Alessia Bardi
173
_SuffixPrefix_
174
It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string.
175
Example:
176
Input string: “Search for the Standard Model Higgs Boson”
177 1 Alessia Bardi
Parameters: suffix and prefix length = 3
178 3 Alessia Bardi
Output list: “ardmod” (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”)
179 1 Alessia Bardi
180
h3. TODOs
181
182 10 Alessia Bardi
* OpenAIRE entity identifier & PID mapping policy (started, to be completed by Claudio and/or Michele DB)
183 1 Alessia Bardi
* Aggregation business logic by major sources:
184
** Unpaywall integration
185
** Crossref integration 
186
** ORCID integration
187
** Cross cleaning actions: hostedBy patch
188
** Scholexplorer business logic (relationship resolution)
189 3 Alessia Bardi
** DataCite
190 1 Alessia Bardi
** EuropePMC
191
** more….
192 10 Alessia Bardi
* Deduplication business logic (started, to be completed by Michele DB)
193 9 Alessia Bardi
** For research outputs ( -publications- , datasets, software, orp)
194 1 Alessia Bardi
** For research organizations 
195
* Enrichment
196
** Mining business logic
197
** Deduction-based inference 
198
** Propagation business logic
199
* Post-cleaning business logic
200
* FAQ