Project

General

Profile

DOIBoost » History » Version 52

Miriam Baglioni, 11/03/2022 04:58 PM

1 1 Alessia Bardi
h1. DOIBoost
2
3 31 Claudio Atzori
{{>toc}}
4
5 4 Alessia Bardi
h2. DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
6 1 Alessia Bardi
7
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at: 
8
9
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071
10
11 42 Claudio Atzori
In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI. As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.
12 5 Alessia Bardi
13 43 Alessia Bardi
Each Crossref record is enriched with:
14
* ORCID identifiers of authors from ORCID
15
* Open Access instance (with OA color/route and license) from Unpaywall
16
* the following information from MAG:
17
** abstracts
18
** MAG identifiers of authors
19 48 Claudio Atzori
** affiliation (result - organization) relationships
20 43 Alessia Bardi
** subjects (MAG FieldsOfStudy)
21
** conference or journal information 
22
23 44 Alessia Bardi
The Open Access status is also set by intersecting the journal information of a record with the journal lists available from DOAJ and the Gold ISSN list.
24
25 43 Alessia Bardi
26 5 Alessia Bardi
h3. Inputs
27
28 6 Alessia Bardi
* *Crossref*: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
29 27 Claudio Atzori
* *Microsoft Academic Graph*: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
30 18 Alessia Bardi
* *ORCID*: baseline dump obtained in 2020-10-13, regularly updated every week from the "ORCID public API":https://info.orcid.org/documentation/features/public-api/
31 7 Alessia Bardi
* *Unpaywall*: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot) 
32 5 Alessia Bardi
33 27 Claudio Atzori
The construction of the DOIBoost dataset consists of the following phases:
34 1 Alessia Bardi
35 26 Claudio Atzori
h3. 1 Filtering
36 1 Alessia Bardi
37 26 Claudio Atzori
Records in Crossref are ruled out according to the following criteria
38
39 1 Alessia Bardi
* have blank title
40 50 Miriam Baglioni
** Examples:
41
*** 10.1093/rheumatology/41.7.837
42
*** 10.1093/qjmed/95.7.430
43
*** 10.1371/journal.pone.0171434.g005
44 26 Claudio Atzori
* have one of the following publishers: @"Test accounts"@, @"CrossRef Test Account"@
45 49 Claudio Atzori
** Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
46
*** 10.1007/bf00344543
47
*** 10.1007/bf00186154
48
*** 10.1306/64ed947a-1724-11d7-8645000102c1865d
49 1 Alessia Bardi
* have no authors with valid names, where valid means: not blank and different from all strings in this list: @List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")@
50 52 Miriam Baglioni
** Examples for blank authors:
51
*** 10.1108/00070709810247807 
52
*** 10.1016/s1074-9098(02)00346-5
53
*** 10.1136/heart.88.1.6
54
** Examples for "none" author from https://api.crossref.org/works?query.author=%22none%22
55 51 Miriam Baglioni
*** 10.4007/annals.2016.184.3.11
56
*** 10.4007/annals.2012.176.1.6
57 1 Alessia Bardi
*** 10.2172/6393585
58 52 Miriam Baglioni
** Examples for "test" author from https://api.crossref.org/works?query.author=%22test%22
59
*** 10.5116/ijme.54ca.a5ae
60
*** 10.5755/j01.ss.71.2.544
61
*** 10.5755/j01.ee.22.2.319
62 26 Claudio Atzori
* have @"Addie Jackson"@ as author and @"Elsevier BV"@ as publisher (empirically we say they are test records)
63 49 Claudio Atzori
** Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
64
*** 10.2139/ssrn.2082156
65
*** 10.2139/ssrn.2202300
66
*** 10.2139/ssrn.2255657
67 15 Alessia Bardi
* have not one of the following values in the field @type@:
68 26 Claudio Atzori
**   @"book-section"@
69
**   @"book"@
70
**   @"book-chapter"@
71
**   @"book-part"@
72
**   @"book-series"@
73
**   @"book-set"@
74
**   @"book-track"@
75
**   @"edited-book"@
76
**   @"reference-book"@
77
**   @"monograph"@
78
**   @"journal-article"@
79
**   @"dissertation"@
80
**   @"other"@
81
**   @"peer-review"@
82
**   @"proceedings"@
83
**   @"proceedings-article"@
84
**   @"reference-entry"@
85
**   @"report"@
86
**   @"report-series"@
87
**   @"standard"@
88
**   @"standard-series"@
89
**   @"posted-content"@
90
**   @"dataset"@
91 15 Alessia Bardi
92 1 Alessia Bardi
Records with @type=dataset@ are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
93 19 Alessia Bardi
94 30 Claudio Atzori
h3. 2 Mapping Crossref properties into the OpenAIRE Research Graph
95 16 Alessia Bardi
96 27 Claudio Atzori
Properties in OpenAIRE results are set based on the logic described in the following table:
97 16 Alessia Bardi
98 1 Alessia Bardi
|_.OpenAIRE Result field path|_.Crossref path(s)|_.Notes|
99 33 Alessia Bardi
| id | doi | id in the form @doi_________::md5(doi) |
100 34 Alessia Bardi
| dateofcollection|  indexed.datetime|  |
101
| lastupdatetimestamp | indexed.timestamp | |
102 1 Alessia Bardi
| type | type | @dataset@ if the Crossref type is dataset, @publication@ otherwise (based on the logics [[DOIBoost#1-Filtering|described above]]) |
103 34 Alessia Bardi
| originalId | doi, clinical-trial-number, alternative-id | |
104
| pid |  | The scheme tells the type of PID, the value contains the actual value |
105
| pid.scheme |  | Default value: doi |
106
| pid.value | doi | The doi is normalised and lower-cased|
107 1 Alessia Bardi
| maintitle | title |  |
108
| subtitle | subtitle |  |
109 37 Alessia Bardi
| author | author | if available the sequence is mapped to rank and the ORCID is also mapped |
110
| author.name | author.given |  |
111
| author.surname | author.family |  |
112
| author.fullname | author.given author.family|  |
113
| author.rank | | based on the order, starts from 1 |
114
| author.pid | | only if the ORCID is available |
115
| author.pid.id.scheme | | Default 'pending_orcid' (meaning that it is not an id confirmed by ORCID |
116
| author.pid.id.value | author.ORCID |  |
117
| author.pid.provenance.provenance| | Default 'Harvested' |
118
| author.pid.provenance.trust| | Default '0.9' |
119 1 Alessia Bardi
| description | abstract |  |
120 34 Alessia Bardi
| subject | subject | with classid='keywords', i.e. no controlled vocabularies for Crossref subjects |
121
| publicationdate | issued.datetime or, if not available, created.datetime |  |
122
| publisher | publisher |  |
123
| source | source |  only if the record is not of type @book@ |
124
| source | concatenation of @container-title.head@ "ISBN: " @ISBN.head@ | only if the record is of type @book@ |
125
| container |  | It is set only for publications with information about the journal it was published in. |
126
| container.name | container-title.head| |
127
| container.issnOnline | issn-type.value| if issn-type.type='electronic'|
128
| container.issnPrinted | issn-type.value| if issn-type.type='print'|
129
| container.vol | volume| |
130
| container.sp | page | before '-'| 
131
| container.ep | page | after '-'| 
132
| instance | | One instance is created with the DOI URL|
133 1 Alessia Bardi
| instance.accessright |  | Values in @instance.accessright.code@ and @instance.accessright.label@ are set based on license and dateofacceptance: 
134
- UNKNOWN: if the license is blank
135
- OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see "Unpaywall FAQ":https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license- for details) or if OUP license, but only after 12 months from the publication date
136
- EMBARGO: OUP license, before 12 months from the publication date
137
- CLOSED: if there is a license not covered by the previous cases |
138 33 Alessia Bardi
| instance.accessright.code |  | Code from the "COAR vocabulary for access right":http://vocabularies.coar-repositories.org/documentation/access_rights/|
139
| instance.accessright.label |  | One of: OPEN, RESTRICTED, CLOSED, EMBARGO |
140
| instance.accessright.scheme |  | Scheme that defines the code and label, i.e. the URL to the "COAR vocabulary for access right":http://vocabularies.coar-repositories.org/documentation/access_rights/|
141
| instance.accessright.openAccessRoute |  | only if instance.accessright.value = 'OPEN ACCESS'. Default is 'hybrid'. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.|
142
| instance.license | license.URL| If there is a @license.content-version='vor'@, then this is used. Otherwise the first license entry is used. |
143 34 Alessia Bardi
| instance.pid |  | The scheme tells the type of PID, the value contains the actual value |
144
| instance.pid.scheme |  | Default value: doi  |
145
| instance.pid.value | doi | The doi is normalised and lower-cased  |
146
| instance.publicationdate | issued.datetime or, if not available, created.datetime|  |
147 40 Alessia Bardi
| instance.refereed |  | set to 'peerReviewed' only if @relation.has-review.id@ is not empty. UNKNOWN otherwise.|
148 33 Alessia Bardi
| instance.type | subtype | mapped using the "OpenAIRE vocabulary for result typologies":https://api.openaire.eu/vocabularies/dnet:result_typologies |
149 34 Alessia Bardi
| instance.url | doi| Full URL of the DOI | 
150 33 Alessia Bardi
151
All other fields of the [[Json_schema]] not mentioned in the table contain empty values.
152
153
All the records from Crossref are related to the datasource with @name=Crossref@ and @id=openaire____::081b82f96300b6a6e3d282bad31cb6e2@
154
155 34 Alessia Bardi
Possible improvements:
156
* map @clinical-trial-number@ and @alternative-id@ in alternateIdentifiers?
157
* Verify if Crossref has a property for @language@, @country@, @container.issnLinking@, @container.iss@, @container.edition@, @container.conferenceplace@ and @container.conferencedate@ 
158
* Different approach to set the @refereed@ field and improve its coverage?
159 2 Alessia Bardi
160 4 Alessia Bardi
h3. 2 Map Crossref links to projects/funders
161
162 3 Alessia Bardi
Links to funding available in Crossref are mapped as funding relationships (@result@ -- @isProducedBy@ --> @project@) applying the following mapping:
163
164
| *funder* | *grant code* | *Link to* |
165
| DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}
166
or name: 'European Union’s Horizon 2020 research and innovation program' | series of 4-9 digits in @award@ | Link to H2020 project |
167
| DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in @award@  | Link to FP7 project |
168
| DOI: 10.13039/501100000781 OR name: 'European Union's'| series of 4-9 digits in @award@ | Link to FP7 or H2020 project |
169
| DOI: 10.13039/100000001 | @award@ | Link to NSF project |
170
| DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | @award@ | Link to ANR project |
171
| DOI: 10.13039/501100002341 | @award@ | Link to Academy of Finland project |
172
| DOI: 10.13039/501100001602 | @award@, removing the initial 'SFI' if present | Link to SFI project |
173
| DOI: 10.13039/501100000923 | @award@ | Link to ARC project |
174
| DOI: 10.13039/501100000038 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (@unidentified@ project) |
175
| DOI: 10.13039/501100000155 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (@unidentified@ project) |
176
| DOI: 10.13039/501100000024 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (@unidentified@ project) |
177
| DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | @award@ | Link to CONICYT project |
178
| DOI: 10.13039/501100003448 | series of 4-9 digits in @award@ | Link to GSRT project |
179
| DOI: 10.13039/501100010198 | @award@ | Link to SGOV project |
180
| DOI: 10.13039/501100004564 | series of 4-9 digits in @award@ | Link to MESTD project |
181
| DOI: 10.13039/501100003407 | @award@ | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (@unidentified@ project) is also generated |
182
| DOI: {10.13039/501100006588, 10.13039/501100004488} | @award@, removing 'Project No' and 'HRZZ' prefix, if present | Link to HRZZ or MZOS project |
183
| DOI: 10.13039/501100006769 | @award@ | Link to Russian Science Foundation project |
184 1 Alessia Bardi
| DOI: 10.13039/501100001711 | @award@ after '_' and before '/' | Link to SNSF project |
185
| DOI: 10.13039/501100004410 | @award@ | Link to TUBITAK project |
186 4 Alessia Bardi
| DOI: 10.10.13039/100004440  or name: 'Wellcome Trust Masters Fellowship'| @award@ | Link to Wellcome Trust specific project and to the @unidentified@ project.|
187 1 Alessia Bardi
188 36 Alessia Bardi
h3. 3 Intersect Crossref with UnpayWall by DOI (DOIBoost1)
189 10 Alessia Bardi
190 36 Alessia Bardi
The fields we consider from UnpayWall are:
191 10 Alessia Bardi
* @is_oa@
192
* @best_oa_location@
193
* @oa_status@
194 1 Alessia Bardi
195 36 Alessia Bardi
The results of Crossref that intersect by DOI with UnpayWall records are enriched with one additional @instance@ with the following properties:
196 1 Alessia Bardi
197
|_.OpenAIRE Result field path |_.Unpaywall field path |_.Notes |
198 36 Alessia Bardi
| instance | | created only if @is_oa@ and a @best_oa_location@ is available |
199
| instance.accessright | | default value  Open Access: we do not add instances if UnpayWall says there is no open version|
200
| instance.accessright.code |  | Open Access code from the "COAR vocabulary for access right":http://vocabularies.coar-repositories.org/documentation/access_rights/|
201
| instance.accessright.label |  | Always OPEN |
202
| instance.accessright.scheme |  | Scheme that defines the code and label, i.e. the URL to the "COAR vocabulary for access right":http://vocabularies.coar-repositories.org/documentation/access_rights/|
203
| instance.accessright.openAccessRoute | @oa_status@ | |
204
| instance.url | @best_oa_location@ | |
205
| instance.license | @best_oa_location.license@ | |
206
| instance.pid |  | The scheme tells the type of PID, the value contains the actual value |
207
| instance.pid.scheme |  | Default value: doi  |
208
| instance.pid.value | doi | The doi is normalised and lower-cased  |
209 10 Alessia Bardi
210 36 Alessia Bardi
For the definition of UnpayWall's @oa_status@ refer to the "Unpaywall FAQ":https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-
211
212
The record will also feature a relation to the UnpayWall data source: @name="UnpayWall"@, @id=openaire____::8ac8380272269217cb09a928c8caa993@.
213 9 Alessia Bardi
214 21 Alessia Bardi
h3. 4 Intersect DOIBoost1 with ORCID (DOIBoost2)
215
216 23 Alessia Bardi
The fields we consider from ORCID are:
217 21 Alessia Bardi
* @doi@
218
* @authors@, a list of authors, each with optional @name@, @surname@, @creditName@, @oid@
219
220
|_.OpenAIRE field path|_.ORCID path|_.Notes|
221
| pid | doi  |  |
222
| author.name | capitalize(name) |  only mapped if not blank|
223
| author.surname | capitalize(surname) | only mapped if not blank |
224
| author.fullname |  | if name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName  |
225 38 Alessia Bardi
| author.pid | | only if the ORCID is available |
226
| author.pid.id.scheme | | Default 'orcid' (meaning that it is confirmed by ORCID, (in contrast to the 'orcid_pending' set from Crossref and Unpaywall) |
227
| author.pid.id.value | oid |  |
228
| author.pid.provenance.provenance| | Default 'Harvested' |
229
| author.pid.provenance.trust| | Default '0.9' |
230 1 Alessia Bardi
231 4 Alessia Bardi
The records are enriched with the ORCID identifiers of their authors.
232 46 Alessia Bardi
233
TODO: Update with the new approach implemented by Miriam. 
234
235 38 Alessia Bardi
The current approach is: 
236 1 Alessia Bardi
* if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available;
237
* if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available
238
239 38 Alessia Bardi
Miriam will modify the process to ensure that:
240
* the list of authors from Crossred always "win"
241
* the identifiers from ORCID "win"
242 46 Alessia Bardi
243 24 Alessia Bardi
h3. 5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)
244
245
*Important Notes*
246
* Only papers with DOI are considered
247
* Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset @Papers_distinct@
248
249
When mapping MAG records to the OpenAIRE Research Graph, we consider the the following MAG tables:
250
* @PaperAbstractsInvertedIndex@: for the paper abstracts
251
* @Authors@: for the authors. The MAG data is pre-processed by grouping authors by PaperId 
252
* @Affiliations@ and @PaperAuthorAffiliations@: to generate links between publications and organisations
253
* @Journals@ and @ConferenceInstances@: joined with @Papers_distinct@ to have the information about the venues where the paper was published
254
* TO BE REMOVED @PaperUrls@: to create one instance for the OpenAIRE publication
255 1 Alessia Bardi
* @FieldsOfStudy@: to add subjects
256
257
The records are enriched with:
258
* abstracts
259
* MAG identifiers of authors
260 4 Alessia Bardi
* affiliation relationships
261
* subjects (MAG FieldsOfStudy)
262 24 Alessia Bardi
* conference or journal information (in the @journal@ field) TODO: or @container@, in case of the dump?
263
* [TO BE REMOVED] instances with URL from MAG
264
265
266
TODO: ensure we use the field names of the public dump
267
268
|_.OpenAIRE path| ._MAG table |_.MAG path(s)|_.Notes|
269
| pid | Papers_distinct | Doi | |
270 39 Alessia Bardi
| pid.scheme |  | | Default value: doi |
271
| pid.value | | doi | The doi is normalised and lower-cased|
272 24 Alessia Bardi
| originalId | Papers_distinct | PaperId | |
273
| title | Papers_distinct | PaperTitle | as main title |
274
| title | Papers_distinct | OriginalTitle | as alternative title |
275
| source | Papers_distinct | BookTitle |  |
276
| dateofacceptance | Papers_distinct  | Date |  first 10 chars, if not blank |
277 1 Alessia Bardi
| publisher | Papers_distinct, Journal  | Publisher |   |
278
| description | PaperAbstractsInvertedIndex | IndexedAbstract | |
279 47 Alessia Bardi
| container | ConferenceInstances, Journals |  ||
280
| container.name | | DisplayName ||
281
| container.conferencePlace | | Location ||
282
| container.conferenceDate | | StartDate and EndDate | Values created as concatanation of the first 10 chars of each (separated by '-'), if both are not blank|
283
| container.sp | | FirstPage ||
284
| container.ep | | EndPage ||
285
| container.issnPrinted | | Issn ||
286
| container.vol | Papers_distinct | Volume ||
287
| container.iss | Papers_distinct | Issue ||
288 24 Alessia Bardi
| subject |FieldsOfStudy  | subjects | All subjects from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG. 
289
We create one subject per DisplayName, per MainType and, if the MainType is in the format @x.y@, one subject also for the first token (i.e. @x@) |
290
| subject.value | | subjects.DisplayName, subjects.MainType, split(subjects.MainType, '.'0.head | All subject from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG|
291
| author | Authors, PaperAuthorAffiliations | |  |
292
| author.rank |  | sequenceNumber |  |
293
| author.fullName |  | DisplayName | if not blank |
294
| author.affiliation |  | affiliation | if not null |
295
| author.pid |  | AuthorId |  MAG id of the author as URL. TO BE REMOVED? |
296
| instance | PaperUrls | | TO BE REMOVED. Currently maps to the MAG URl and to any URLs in SourceUrl |
297 4 Alessia Bardi
298
h3. 6 Enrich DOIBoost3 with hosting data sources (@hostedby@) and access right information
299 27 Claudio Atzori
300 25 Alessia Bardi
In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a @journal.[l|e]issn@ that match are enriched as follows:
301 1 Alessia Bardi
* Each instance gain the @hostedby@ information corresponding to the journal
302
* If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)
303
304
The hostedby of records that do not match are set to the "Unknown Repository".