DOIBoost » History » Version 26
Claudio Atzori, 11/11/2021 09:52 AM
1 | 1 | Alessia Bardi | h1. DOIBoost |
---|---|---|---|
2 | |||
3 | 4 | Alessia Bardi | h2. DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID |
4 | 1 | Alessia Bardi | |
5 | The idea behind DOIBoost and its origin can be found in the paper (and related resources) at: |
||
6 | |||
7 | * La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071 |
||
8 | |||
9 | In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI. |
||
10 | 5 | Alessia Bardi | |
11 | h3. Inputs |
||
12 | |||
13 | 6 | Alessia Bardi | * *Crossref*: dump available to Crossref subscribers via MetadataPlus service, updated once a month. |
14 | 5 | Alessia Bardi | * *Micorsoft Academic Graph*: downloaded version on 2021-02-15. We plan to take a latest version on Dec 2021 before MAG will be retired. |
15 | 18 | Alessia Bardi | * *ORCID*: baseline dump obtained in 2020-10-13, regularly updated every week from the "ORCID public API":https://info.orcid.org/documentation/features/public-api/ |
16 | 7 | Alessia Bardi | * *Unpaywall*: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot) |
17 | 5 | Alessia Bardi | |
18 | 1 | Alessia Bardi | The generation of DOIBoost consists in the following phases: |
19 | |||
20 | 26 | Claudio Atzori | h3. 1 Filtering |
21 | 1 | Alessia Bardi | |
22 | 26 | Claudio Atzori | Records in Crossref are ruled out according to the following criteria |
23 | |||
24 | 1 | Alessia Bardi | * have blank title |
25 | 26 | Claudio Atzori | * have one of the following publishers: @"Test accounts"@, @"CrossRef Test Account"@ |
26 | 1 | Alessia Bardi | * have no authors with valid names, where valid means: not blank and different from all strings in this list: @List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")@ |
27 | 26 | Claudio Atzori | * have @"Addie Jackson"@ as author and @"Elsevier BV"@ as publisher (empirically we say they are test records) |
28 | 15 | Alessia Bardi | * have not one of the following values in the field @type@: |
29 | 26 | Claudio Atzori | ** @"book-section"@ |
30 | ** @"book"@ |
||
31 | ** @"book-chapter"@ |
||
32 | ** @"book-part"@ |
||
33 | ** @"book-series"@ |
||
34 | ** @"book-set"@ |
||
35 | ** @"book-track"@ |
||
36 | ** @"edited-book"@ |
||
37 | ** @"reference-book"@ |
||
38 | ** @"monograph"@ |
||
39 | ** @"journal-article"@ |
||
40 | ** @"dissertation"@ |
||
41 | ** @"other"@ |
||
42 | ** @"peer-review"@ |
||
43 | ** @"proceedings"@ |
||
44 | ** @"proceedings-article"@ |
||
45 | ** @"reference-entry"@ |
||
46 | ** @"report"@ |
||
47 | ** @"report-series"@ |
||
48 | ** @"standard"@ |
||
49 | ** @"standard-series"@ |
||
50 | ** @"posted-content"@ |
||
51 | ** @"dataset"@ |
||
52 | 15 | Alessia Bardi | |
53 | 1 | Alessia Bardi | Records with @type=dataset@ are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication. |
54 | 19 | Alessia Bardi | |
55 | 20 | Alessia Bardi | h4. Mapping Crossref properties into the OpenAIRE Research Graph |
56 | 16 | Alessia Bardi | |
57 | Properties in OpenAIRE results are set based on the logics described in the following table: |
||
58 | |||
59 | TODO: ensure we use the field names of the public dump |
||
60 | |||
61 | |_.OpenAIRE Result field path|_.Crossref path(s)|_.Notes| |
||
62 | | pid | doi, clinical-trial-number, alternative-id | the doi is normalised and lowered case| |
||
63 | | dateofcollection| indexed.datetime| | |
||
64 | | collectedfrom.name | | Default value "Crossref"| |
||
65 | | collectedfrom.id | | TODO Default value ID| |
||
66 | | publisher | publisher | | |
||
67 | | title | title | as main title | |
||
68 | | title | original-title, short-title | as alternative title | |
||
69 | | title | subtitle | as subtitle | |
||
70 | | description | abstract | | |
||
71 | | source | source | only if the record is not of type @book@ | |
||
72 | | source | @${container-title.head} ISBN: ${ISBN.head}@ | only if the record is of type @book@ | |
||
73 | | dateofacceptance | issued.datetime or, if not available, created.datetime| | |
||
74 | | relevantdate | created.datetime, posted.datetime, accepted.datetime, published-print, published-online | | |
||
75 | | subject | subject | with classid='keywords', i.e. no controlled vocabularies for Crossref subjects | |
||
76 | | author | author | if available the sequence is mapped to rank and the ORCID is also mapped (as 'orcid_pending') | |
||
77 | | journal | | only for publications| |
||
78 | | journal.name | container-title.head| | |
||
79 | | journal.eissn | issn-type.value| if issn-type.type='electronic'| |
||
80 | | journal.issn | issn-type.value| if issn-type.type='print'| |
||
81 | | journal.vol | volume| | |
||
82 | | journal.sp | page | before '-'| |
||
83 | | journal.ep | page | after '-'| |
||
84 | | instance | | TODO One instance is created . . . | |
||
85 | | instance.license | license.URL| If there is a @license.content-version='vor'@, then this is used. Otherwise the first license entry is used. | |
||
86 | | instance.pid | | the list of pids as in the first row of this table| |
||
87 | | instance.refereed | | set to 'peerReviewed' only if @relation.has-review.id@ is not empty| |
||
88 | | instance.instancetype | subtype | mapped using the OpenAIRE vocabularies | |
||
89 | | instance.collectedfrom | | as in result.collectedfrom above | |
||
90 | | instance.dateofacceptance | | as in result.dateofacceptance above | |
||
91 | | instance.url | URL, link.URL| there may be different URLs in the same instance | |
||
92 | | instance.accessright.value | | based on license and dateofacceptance: |
||
93 | - UNKNOWN: if license is blank |
||
94 | - OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see "Unpaywall FAQ":https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license- for details) or if OUP license, but only after 12 months from the publication date |
||
95 | - EMBARGO: OUP license, before 12 months from the publication date |
||
96 | - CLOSED: if there is a license not covered by the previous cases | |
||
97 | | instance.accessright.openaccessroute | | only if instance.accessright.value = 'OPEN ACCESS'. Default is 'hybrid'. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.| |
||
98 | 4 | Alessia Bardi | |
99 | 2 | Alessia Bardi | h3. 2 Map Crossref links to projects/funders |
100 | 4 | Alessia Bardi | |
101 | Links to funding available in Crossref are mapped as funding relationships (@result@ -- @isProducedBy@ --> @project@) applying the following mapping: |
||
102 | 3 | Alessia Bardi | |
103 | | *funder* | *grant code* | *Link to* | |
||
104 | | DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} |
||
105 | or name: 'European Union’s Horizon 2020 research and innovation program' | series of 4-9 digits in @award@ | Link to H2020 project | |
||
106 | | DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in @award@ | Link to FP7 project | |
||
107 | | DOI: 10.13039/501100000781 OR name: 'European Union's'| series of 4-9 digits in @award@ | Link to FP7 or H2020 project | |
||
108 | | DOI: 10.13039/100000001 | @award@ | Link to NSF project | |
||
109 | | DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | @award@ | Link to ANR project | |
||
110 | | DOI: 10.13039/501100002341 | @award@ | Link to Academy of Finland project | |
||
111 | | DOI: 10.13039/501100001602 | @award@, removing the initial 'SFI' if present | Link to SFI project | |
||
112 | | DOI: 10.13039/501100000923 | @award@ | Link to ARC project | |
||
113 | | DOI: 10.13039/501100000038 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (@unidentified@ project) | |
||
114 | | DOI: 10.13039/501100000155 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (@unidentified@ project) | |
||
115 | | DOI: 10.13039/501100000024 | @award@ ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (@unidentified@ project) | |
||
116 | | DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | @award@ | Link to CONICYT project | |
||
117 | | DOI: 10.13039/501100003448 | series of 4-9 digits in @award@ | Link to GSRT project | |
||
118 | | DOI: 10.13039/501100010198 | @award@ | Link to SGOV project | |
||
119 | | DOI: 10.13039/501100004564 | series of 4-9 digits in @award@ | Link to MESTD project | |
||
120 | | DOI: 10.13039/501100003407 | @award@ | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (@unidentified@ project) is also generated | |
||
121 | | DOI: {10.13039/501100006588, 10.13039/501100004488} | @award@, removing 'Project No' and 'HRZZ' prefix, if present | Link to HRZZ or MZOS project | |
||
122 | | DOI: 10.13039/501100006769 | @award@ | Link to Russian Science Foundation project | |
||
123 | | DOI: 10.13039/501100001711 | @award@ after '_' and before '/' | Link to SNSF project | |
||
124 | 1 | Alessia Bardi | | DOI: 10.13039/501100004410 | @award@ | Link to TUBITAK project | |
125 | | DOI: 10.10.13039/100004440 or name: 'Wellcome Trust Masters Fellowship'| @award@ | Link to Wellcome Trust specific project and to the @unidentified@ project.| |
||
126 | 4 | Alessia Bardi | |
127 | 1 | Alessia Bardi | h3. 3 Intersect Crossref with Unpaywall by DOI (DOIBoost1) |
128 | 10 | Alessia Bardi | |
129 | The fields we consider from Unpaywall are: |
||
130 | * @is_oa@ |
||
131 | * @best_oa_location@ |
||
132 | 1 | Alessia Bardi | * @oa_status@ |
133 | 10 | Alessia Bardi | |
134 | |||
135 | The results of Crossref that intersect by DOI with Unpaywall records are enriched with: |
||
136 | 11 | Alessia Bardi | |
137 | TODO: ensure we refer to json fields of the public dump |
||
138 | 10 | Alessia Bardi | |
139 | 12 | Alessia Bardi | |_.OpenAIRE Result field path |_.Unpaywall field path |_.Notes | |
140 | 1 | Alessia Bardi | | result.instance | | created only if @is_oa@ and a @best_oa_location@ is available | |
141 | 11 | Alessia Bardi | | result.instance.collectedfrom.name | | default value "Unpaywall" | |
142 | 12 | Alessia Bardi | | result.instance.collectedfrom.id | | default value TODO | |
143 | | result.instance.url | @best_oa_location@ | | |
||
144 | | result.instance.license | @best_oa_location.license@ | | |
||
145 | 10 | Alessia Bardi | | result.instance.pid | @doi@ | | |
146 | 12 | Alessia Bardi | | result.instance.accessright | | default value Open Access: we do not add instances if Unpaywall says there is no open version| |
147 | 10 | Alessia Bardi | | result.instance.accessright.route | @oa_status@ | | |
148 | 11 | Alessia Bardi | |
149 | 9 | Alessia Bardi | For the definition of Unpaywall's @oa_status@ refer to the "Unpaywall FAQ":https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean- |
150 | |||
151 | h3. 4 Intersect DOIBoost1 with ORCID (DOIBoost2) |
||
152 | 21 | Alessia Bardi | |
153 | The fields we consider from ORCID are: |
||
154 | 23 | Alessia Bardi | * @doi@ |
155 | 21 | Alessia Bardi | * @authors@, a list of authors, each with optional @name@, @surname@, @creditName@, @oid@ |
156 | |||
157 | |_.OpenAIRE field path|_.ORCID path|_.Notes| |
||
158 | | pid | doi | | |
||
159 | | author.name | capitalize(name) | only mapped if not blank| |
||
160 | | author.surname | capitalize(surname) | only mapped if not blank | |
||
161 | | author.fullname | | if name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName | |
||
162 | | author.pid | oid | as confirmed ORCID identifier (in contrast to the 'orcid_pending' set from Crossref and Unpaywall | |
||
163 | |||
164 | 4 | Alessia Bardi | |
165 | The records are enriched with the ORCID identifiers of their authors. |
||
166 | * if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available; |
||
167 | 21 | Alessia Bardi | * if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available |
168 | |||
169 | 4 | Alessia Bardi | TODO: How do we ensure that if an author comes with an orcid_pending from Crossref and one orcid from ORCID, the last wins? |
170 | 1 | Alessia Bardi | |
171 | h3. 5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3) |
||
172 | 24 | Alessia Bardi | |
173 | *Important Notes* |
||
174 | * Only papers with DOI are considered |
||
175 | * Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset @Papers_distinct@ |
||
176 | |||
177 | When mapping MAG records to the OpenAIRE Research Graph, we consider the the following MAG tables: |
||
178 | * @PaperAbstractsInvertedIndex@: for the paper abstracts |
||
179 | * @Authors@: for the authors. The MAG data is pre-processed by grouping authors by PaperId |
||
180 | * @Affiliations@ and @PaperAuthorAffiliations@: to generate links between publications and organisations |
||
181 | * @Journals@ and @ConferenceInstances@: joined with @Papers_distinct@ to have the information about the venues where the paper was published |
||
182 | * TO BE REMOVED @PaperUrls@: to create one instance for the OpenAIRE publication |
||
183 | * @FieldsOfStudy@: to add subjects |
||
184 | 1 | Alessia Bardi | |
185 | The records are enriched with: |
||
186 | * abstracts |
||
187 | * MAG identifiers of authors |
||
188 | * affiliation relationships |
||
189 | 4 | Alessia Bardi | * subjects (MAG FieldsOfStudy) |
190 | * conference or journal information (in the @journal@ field) TODO: or @container@, in case of the dump? |
||
191 | 24 | Alessia Bardi | * [TO BE REMOVED] instances with URL from MAG |
192 | |||
193 | |||
194 | TODO: ensure we use the field names of the public dump |
||
195 | |||
196 | |_.OpenAIRE path| ._MAG table |_.MAG path(s)|_.Notes| |
||
197 | | pid | Papers_distinct | Doi | | |
||
198 | | originalId | Papers_distinct | PaperId | | |
||
199 | | title | Papers_distinct | PaperTitle | as main title | |
||
200 | | title | Papers_distinct | OriginalTitle | as alternative title | |
||
201 | | source | Papers_distinct | BookTitle | | |
||
202 | | dateofacceptance | Papers_distinct | Date | first 10 chars, if not blank | |
||
203 | | publisher | Papers_distinct, Journal | Publisher | | |
||
204 | | description | PaperAbstractsInvertedIndex | IndexedAbstract | | |
||
205 | | journal | ConferenceInstances, Journals | || |
||
206 | | journal.name | | DisplayName || |
||
207 | | journal.conferencePlace | | Location || |
||
208 | | journal.conferenceDate | | StartDate and EndDate | Values created as concatanation of the first 10 chars of each (separated by '-'), if both are not blank| |
||
209 | | journal.sp | | FirstPage || |
||
210 | | journal.ep | | EndPage || |
||
211 | | journal.issnPrinted | | Issn || |
||
212 | | journal.vol | Papers_distinct | Volume || |
||
213 | | journal.iss | Papers_distinct | Issue || |
||
214 | | subject |FieldsOfStudy | subjects | All subjects from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG. |
||
215 | We create one subject per DisplayName, per MainType and, if the MainType is in the format @x.y@, one subject also for the first token (i.e. @x@) | |
||
216 | | subject.value | | subjects.DisplayName, subjects.MainType, split(subjects.MainType, '.'0.head | All subject from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG| |
||
217 | | author | Authors, PaperAuthorAffiliations | | | |
||
218 | | author.rank | | sequenceNumber | | |
||
219 | | author.fullName | | DisplayName | if not blank | |
||
220 | | author.affiliation | | affiliation | if not null | |
||
221 | | author.pid | | AuthorId | MAG id of the author as URL. TO BE REMOVED? | |
||
222 | | instance | PaperUrls | | TO BE REMOVED. Currently maps to the MAG URl and to any URLs in SourceUrl | |
||
223 | |||
224 | 4 | Alessia Bardi | h3. 6 Enrich DOIBoost3 with hosting data sources (@hostedby@) and access right information |
225 | |||
226 | 25 | Alessia Bardi | In this phase we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a @journal.[l|e]issn@ that match are enriched as follows: |
227 | * Each instance gain the @hostedby@ information corresponding to the journal |
||
228 | 1 | Alessia Bardi | * If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list) |
229 | |||
230 | The hostedby of records that do not match are set to the "Unknown Repository". |