Project

General

Profile

DOIBoost » History » Version 15

Alessia Bardi, 10/11/2021 03:49 PM
Filter by type

1 1 Alessia Bardi
h1. DOIBoost
2
3 4 Alessia Bardi
h2. DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
4 1 Alessia Bardi
5
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at: 
6
7
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071
8
9
In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
10 5 Alessia Bardi
11
h3. Inputs
12
13 6 Alessia Bardi
* *Crossref*: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
14 5 Alessia Bardi
* *Micorsoft Academic Graph*: downloaded version on 2021-02-15. We plan to take a latest version on Dec 2021 before MAG will be retired.
15 13 Alessia Bardi
* *ORCID*: baseline dump obtained in 13.10.2020, regularly updated every week from the ORCID API available at URL
16 7 Alessia Bardi
* *Unpaywall*: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot) 
17 5 Alessia Bardi
18 1 Alessia Bardi
The generation of DOIBoost consists in the following phases:
19
20 4 Alessia Bardi
h3. 1 Filter Crossref records that
21
22 1 Alessia Bardi
* have blank title
23
* have one of the following publishers: "Test accounts", "CrossRef Test Account"
24
* have no authors with valid names, where valid means: not blank and different from all strings in this list: @List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")@
25
* have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
26 15 Alessia Bardi
* have not one of the following values in the field @type@:
27
**   "book-section"
28
**   "book"
29
**   "book-chapter"
30
**   "book-part"
31
**   "book-series"
32
**   "book-set"
33
**   "book-track"
34
**   "edited-book"
35
**   "reference-book"
36
**   "monograph"
37
**   "journal-article"
38
**   "dissertation"
39
**   "other"
40
**   "peer-review"
41
**   "proceedings"
42
**   "proceedings-article"
43
**   "reference-entry"
44
**   "report"
45
**   "report-series"
46
**   "standard"
47
**   "standard-series"
48
**   "posted-content"
49
**   "dataset"
50
51
Records with @type=dataset@ are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
52 1 Alessia Bardi
53 4 Alessia Bardi
h3. 2 Map Crossref links to projects/funders
54 2 Alessia Bardi
55 4 Alessia Bardi
Links to funding available in Crossref are mapped as funding relationships (@result@ -- @isProducedBy@ --> @project@) applying the following mapping:
56
57 3 Alessia Bardi
| *funder* | *grant code* | *Link to* |
58
| DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}
59
or name: 'European Union’s Horizon 2020 research and innovation program' | series of 4-9 digits in @award@ | Link to H2020 project |
60
| DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in @award@  | Link to FP7 project |
61
| DOI: 10.13039/501100000781 OR name: 'European Union's'| series of 4-9 digits in @award@ | Link to FP7 or H2020 project |
62
| DOI: 10.13039/100000001 | @award@ | Link to NSF project |
63
| DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | @award@ | Link to ANR project |
64
| DOI: 10.13039/501100002341 | @award@ | Link to Academy of Finland project |
65
| DOI: 10.13039/501100001602 | @award@, removing the initial 'SFI' if present | Link to SFI project |
66
| DOI: 10.13039/501100000923 | @award@ | Link to ARC project |
67
| DOI: 10.13039/501100000038 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (@unidentified@ project) |
68
| DOI: 10.13039/501100000155 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (@unidentified@ project) |
69
| DOI: 10.13039/501100000024 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (@unidentified@ project) |
70
| DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | @award@ | Link to CONICYT project |
71
| DOI: 10.13039/501100003448 | series of 4-9 digits in @award@ | Link to GSRT project |
72
| DOI: 10.13039/501100010198 | @award@ | Link to SGOV project |
73
| DOI: 10.13039/501100004564 | series of 4-9 digits in @award@ | Link to MESTD project |
74
| DOI: 10.13039/501100003407 | @award@ | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (@unidentified@ project) is also generated |
75
| DOI: {10.13039/501100006588, 10.13039/501100004488} | @award@, removing 'Project No' and 'HRZZ' prefix, if present | Link to HRZZ or MZOS project |
76
| DOI: 10.13039/501100006769 | @award@ | Link to Russian Science Foundation project |
77
| DOI: 10.13039/501100001711 | @award@ after '_' and before '/' | Link to SNSF project |
78
| DOI: 10.13039/501100004410 | @award@ | Link to TUBITAK project |
79 1 Alessia Bardi
| DOI: 10.10.13039/100004440  or name: 'Wellcome Trust Masters Fellowship'| @award@ | Link to Wellcome Trust specific project and to the @unidentified@ project.|
80
81 4 Alessia Bardi
h3. 3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)
82 1 Alessia Bardi
83 10 Alessia Bardi
The fields we consider from Unpaywall are:
84
* @is_oa@
85
* @best_oa_location@
86
* @oa_status@
87 1 Alessia Bardi
88 10 Alessia Bardi
89
The results of Crossref that intersect by DOI with Unpaywall records are enriched with:
90
91 11 Alessia Bardi
TODO: ensure we refer to json fields of the public dump 
92
93 10 Alessia Bardi
|_.OpenAIRE Result field path |_.Unpaywall field path |_.Notes |
94 12 Alessia Bardi
| result.instance | | created only if @is_oa@ and a @best_oa_location@ is available |
95 1 Alessia Bardi
| result.instance.collectedfrom.name | | default value "Unpaywall" |
96 11 Alessia Bardi
| result.instance.collectedfrom.id | | default value TODO |
97 12 Alessia Bardi
| result.instance.url | @best_oa_location@ | |
98
| result.instance.license | @best_oa_location.license@ | |
99
| result.instance.pid | @doi@ | |
100 10 Alessia Bardi
| result.instance.accessright | | default value  Open Access: we do not add instances if Unpaywall says there is no open version|
101 12 Alessia Bardi
| result.instance.accessright.route | @oa_status@ | |
102 10 Alessia Bardi
103 11 Alessia Bardi
For the definition of Unpaywall's @oa_status@ refer to the "Unpaywall FAQ":https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-
104 9 Alessia Bardi
105
h3. 4 Intersect DOIBoost1 with ORCID (DOIBoost2)
106
107 4 Alessia Bardi
The records are enriched with the ORCID identifiers of their authors.
108
* if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available;
109
* if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available
110
111 1 Alessia Bardi
h3. 5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)
112
113
The records are enriched with:
114
* abstracts
115
* MAG identifiers of authors
116
* affiliation relationships
117
* subjects (MAG FieldsOfStudy)
118 4 Alessia Bardi
* conference or journal information (in the @journal@ field) TODO: or @container@, in case of the dump?
119
* [TO BE REMOVED] instances with URL from MAG
120
121
h3. 6 Enrich DOIBoost3 with hosting data sources (@hostedby@) and access right information
122 1 Alessia Bardi
123
In this phase we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a @journal.[l|e]issn@ that match are enriched as follows:
124
* Each instance gain the @hostedby@ information. 
125
* If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route.
126
127
The hostedby of records that do not match are set to the "Unknown Repository".