DOIBoost » History » Revision 44
« Previous |
Revision 44/53
(diff)
| Next »
Alessia Bardi, 22/11/2021 11:02 AM
DOIBoost¶
- Table of contents
- DOIBoost
- DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
- Inputs
- 1 Filtering
- 2 Mapping Crossref properties into the OpenAIRE Research Graph
- 2 Map Crossref links to projects/funders
- 3 Intersect Crossref with UnpayWall by DOI (DOIBoost1)
- 4 Intersect DOIBoost1 with ORCID (DOIBoost2)
- 5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)
- 6 Enrich DOIBoost3 with hosting data sources (hostedby) and access right information
- DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID¶
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
- La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071
In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI. As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.
Each Crossref record is enriched with:- ORCID identifiers of authors from ORCID
- Open Access instance (with OA color/route and license) from Unpaywall
- the following information from MAG:
- abstracts
- MAG identifiers of authors
- affiliation relationships
- subjects (MAG FieldsOfStudy)
- conference or journal information
The Open Access status is also set by intersecting the journal information of a record with the journal lists available from DOAJ and the Gold ISSN list.
TODO:- In the summary explains what is enriched by which input
Clarify the main source is Crossref and that we may lose records that are in the other sources if the relative DOI is not yet in Crossref
Inputs¶
- Crossref: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
- Microsoft Academic Graph: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
- ORCID: baseline dump obtained in 2020-10-13, regularly updated every week from the ORCID public API
- Unpaywall: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)
The construction of the DOIBoost dataset consists of the following phases:
1 Filtering¶
Records in Crossref are ruled out according to the following criteria
- have blank title
- have one of the following publishers:
"Test accounts"
,"CrossRef Test Account"
- have no authors with valid names, where valid means: not blank and different from all strings in this list:
List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")
- have
"Addie Jackson"
as author and"Elsevier BV"
as publisher (empirically we say they are test records) - have not one of the following values in the field
type
:-
"book-section"
-
"book"
-
"book-chapter"
-
"book-part"
-
"book-series"
-
"book-set"
-
"book-track"
-
"edited-book"
-
"reference-book"
-
"monograph"
-
"journal-article"
-
"dissertation"
-
"other"
-
"peer-review"
-
"proceedings"
-
"proceedings-article"
-
"reference-entry"
-
"report"
-
"report-series"
-
"standard"
-
"standard-series"
-
"posted-content"
-
"dataset"
-
Records with type=dataset
are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
2 Mapping Crossref properties into the OpenAIRE Research Graph¶
Properties in OpenAIRE results are set based on the logic described in the following table:
OpenAIRE Result field path | Crossref path(s) | Notes |
---|---|---|
id | doi | id in the form @doi_________::md5(doi) |
dateofcollection | indexed.datetime | |
lastupdatetimestamp | indexed.timestamp | |
type | type | dataset if the Crossref type is dataset, publication otherwise (based on the logics described above) |
originalId | doi, clinical-trial-number, alternative-id | |
pid | The scheme tells the type of PID, the value contains the actual value | |
pid.scheme | Default value: doi | |
pid.value | doi | The doi is normalised and lower-cased |
maintitle | title | |
subtitle | subtitle | |
author | author | if available the sequence is mapped to rank and the ORCID is also mapped |
author.name | author.given | |
author.surname | author.family | |
author.fullname | author.given author.family | |
author.rank | based on the order, starts from 1 | |
author.pid | only if the ORCID is available | |
author.pid.id.scheme | Default 'pending_orcid' (meaning that it is not an id confirmed by ORCID | |
author.pid.id.value | author.ORCID | |
author.pid.provenance.provenance | Default 'Harvested' | |
author.pid.provenance.trust | Default '0.9' | |
description | abstract | |
subject | subject | with classid='keywords', i.e. no controlled vocabularies for Crossref subjects |
publicationdate | issued.datetime or, if not available, created.datetime | |
publisher | publisher | |
source | source | only if the record is not of type book |
source | concatenation of container-title.head "ISBN: " ISBN.head |
only if the record is of type book |
container | It is set only for publications with information about the journal it was published in. | |
container.name | container-title.head | |
container.issnOnline | issn-type.value | if issn-type.type='electronic' |
container.issnPrinted | issn-type.value | if issn-type.type='print' |
container.vol | volume | |
container.sp | page | before '-' |
container.ep | page | after '-' |
instance | One instance is created with the DOI URL | |
instance.accessright | Values in instance.accessright.code and instance.accessright.label are set based on license and dateofacceptance: - UNKNOWN: if the license is blank - OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date - EMBARGO: OUP license, before 12 months from the publication date - CLOSED: if there is a license not covered by the previous cases |
|
instance.accessright.code | Code from the COAR vocabulary for access right | |
instance.accessright.label | One of: OPEN, RESTRICTED, CLOSED, EMBARGO | |
instance.accessright.scheme | Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right | |
instance.accessright.openAccessRoute | only if instance.accessright.value = 'OPEN ACCESS'. Default is 'hybrid'. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. | |
instance.license | license.URL | If there is a license.content-version='vor' , then this is used. Otherwise the first license entry is used. |
instance.pid | The scheme tells the type of PID, the value contains the actual value | |
instance.pid.scheme | Default value: doi | |
instance.pid.value | doi | The doi is normalised and lower-cased |
instance.publicationdate | issued.datetime or, if not available, created.datetime | |
instance.refereed | set to 'peerReviewed' only if relation.has-review.id is not empty. UNKNOWN otherwise. |
|
instance.type | subtype | mapped using the OpenAIRE vocabulary for result typologies |
instance.url | doi | Full URL of the DOI |
All other fields of the Json_schema not mentioned in the table contain empty values.
All the records from Crossref are related to the datasource with name=Crossref
and id=openaire____::081b82f96300b6a6e3d282bad31cb6e2
- map
clinical-trial-number
andalternative-id
in alternateIdentifiers? - Verify if Crossref has a property for
language
,country
,container.issnLinking
,container.iss
,container.edition
,container.conferenceplace
andcontainer.conferencedate
- Different approach to set the
refereed
field and improve its coverage?
2 Map Crossref links to projects/funders¶
Links to funding available in Crossref are mapped as funding relationships (result
-- isProducedBy
--> project
) applying the following mapping:
funder | grant code | Link to |
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Union’s Horizon 2020 research and innovation program' |
series of 4-9 digits in award |
Link to H2020 project |
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in award |
Link to FP7 project |
DOI: 10.13039/501100000781 OR name: 'European Union's' | series of 4-9 digits in award |
Link to FP7 or H2020 project |
DOI: 10.13039/100000001 | award |
Link to NSF project |
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | award |
Link to ANR project |
DOI: 10.13039/501100002341 | award |
Link to Academy of Finland project |
DOI: 10.13039/501100001602 | award , removing the initial 'SFI' if present |
Link to SFI project |
DOI: 10.13039/501100000923 | award |
Link to ARC project |
DOI: 10.13039/501100000038 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to NSERC (unidentified project) |
DOI: 10.13039/501100000155 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to SSHRC (unidentified project) |
DOI: 10.13039/501100000024 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to CIHR (unidentified project) |
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | award |
Link to CONICYT project |
DOI: 10.13039/501100003448 | series of 4-9 digits in award |
Link to GSRT project |
DOI: 10.13039/501100010198 | award |
Link to SGOV project |
DOI: 10.13039/501100004564 | series of 4-9 digits in award |
Link to MESTD project |
DOI: 10.13039/501100003407 | award |
Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified project) is also generated |
DOI: {10.13039/501100006588, 10.13039/501100004488} | award , removing 'Project No' and 'HRZZ' prefix, if present |
Link to HRZZ or MZOS project |
DOI: 10.13039/501100006769 | award |
Link to Russian Science Foundation project |
DOI: 10.13039/501100001711 | award after '_' and before '/' |
Link to SNSF project |
DOI: 10.13039/501100004410 | award |
Link to TUBITAK project |
DOI: 10.10.13039/100004440 or name: 'Wellcome Trust Masters Fellowship' | award |
Link to Wellcome Trust specific project and to the unidentified project. |
3 Intersect Crossref with UnpayWall by DOI (DOIBoost1)¶
The fields we consider from UnpayWall are:is_oa
best_oa_location
oa_status
The results of Crossref that intersect by DOI with UnpayWall records are enriched with one additional instance
with the following properties:
OpenAIRE Result field path | Unpaywall field path | Notes |
---|---|---|
instance | created only if is_oa and a best_oa_location is available |
|
instance.accessright | default value Open Access: we do not add instances if UnpayWall says there is no open version | |
instance.accessright.code | Open Access code from the COAR vocabulary for access right | |
instance.accessright.label | Always OPEN | |
instance.accessright.scheme | Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right | |
instance.accessright.openAccessRoute | oa_status |
|
instance.url | best_oa_location |
|
instance.license | best_oa_location.license |
|
instance.pid | The scheme tells the type of PID, the value contains the actual value | |
instance.pid.scheme | Default value: doi | |
instance.pid.value | doi | The doi is normalised and lower-cased |
For the definition of UnpayWall's oa_status
refer to the Unpaywall FAQ
The record will also feature a relation to the UnpayWall data source: name="UnpayWall"
, id=openaire____::8ac8380272269217cb09a928c8caa993
.
4 Intersect DOIBoost1 with ORCID (DOIBoost2)¶
The fields we consider from ORCID are:doi
authors
, a list of authors, each with optionalname
,surname
,creditName
,oid
OpenAIRE field path | ORCID path | Notes |
---|---|---|
pid | doi | |
author.name | capitalize(name) | only mapped if not blank |
author.surname | capitalize(surname) | only mapped if not blank |
author.fullname | if name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName | |
author.pid | only if the ORCID is available | |
author.pid.id.scheme | Default 'orcid' (meaning that it is confirmed by ORCID, (in contrast to the 'orcid_pending' set from Crossref and Unpaywall) | |
author.pid.id.value | oid | |
author.pid.provenance.provenance | Default 'Harvested' | |
author.pid.provenance.trust | Default '0.9' |
The current approach is:
- if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available;
- if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available
- the list of authors from Crossred always "win"
- the identifiers from ORCID "win"
5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)¶
Important Notes- Only papers with DOI are considered
- Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset
Papers_distinct
PaperAbstractsInvertedIndex
: for the paper abstractsAuthors
: for the authors. The MAG data is pre-processed by grouping authors by PaperIdAffiliations
andPaperAuthorAffiliations
: to generate links between publications and organisationsJournals
andConferenceInstances
: joined withPapers_distinct
to have the information about the venues where the paper was published- TO BE REMOVED
PaperUrls
: to create one instance for the OpenAIRE publication FieldsOfStudy
: to add subjects
- abstracts
- MAG identifiers of authors
- affiliation relationships
- subjects (MAG FieldsOfStudy)
- conference or journal information (in the
journal
field) TODO: orcontainer
, in case of the dump? - [TO BE REMOVED] instances with URL from MAG
TODO: ensure we use the field names of the public dump
OpenAIRE path | ._MAG table | MAG path(s) | Notes |
---|---|---|---|
pid | Papers_distinct | Doi | |
pid.scheme | Default value: doi | ||
pid.value | doi | The doi is normalised and lower-cased | |
originalId | Papers_distinct | PaperId | |
title | Papers_distinct | PaperTitle | as main title |
title | Papers_distinct | OriginalTitle | as alternative title |
source | Papers_distinct | BookTitle | |
dateofacceptance | Papers_distinct | Date | first 10 chars, if not blank |
publisher | Papers_distinct, Journal | Publisher | |
description | PaperAbstractsInvertedIndex | IndexedAbstract | |
journal | ConferenceInstances, Journals | ||
journal.name | DisplayName | ||
journal.conferencePlace | Location | ||
journal.conferenceDate | StartDate and EndDate | Values created as concatanation of the first 10 chars of each (separated by '-'), if both are not blank | |
journal.sp | FirstPage | ||
journal.ep | EndPage | ||
journal.issnPrinted | Issn | ||
journal.vol | Papers_distinct | Volume | |
journal.iss | Papers_distinct | Issue | |
subject | FieldsOfStudy | subjects | All subjects from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG. We create one subject per DisplayName, per MainType and, if the MainType is in the format x.y , one subject also for the first token (i.e. x ) |
subject.value | subjects.DisplayName, subjects.MainType, split(subjects.MainType, '.'0.head | All subject from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG | |
author | Authors, PaperAuthorAffiliations | ||
author.rank | sequenceNumber | ||
author.fullName | DisplayName | if not blank | |
author.affiliation | affiliation | if not null | |
author.pid | AuthorId | MAG id of the author as URL. TO BE REMOVED? | |
instance | PaperUrls | TO BE REMOVED. Currently maps to the MAG URl and to any URLs in SourceUrl |
6 Enrich DOIBoost3 with hosting data sources (hostedby
) and access right information¶
In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a journal.[l|e]issn
that match are enriched as follows:
- Each instance gain the
hostedby
information corresponding to the journal - If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)
The hostedby of records that do not match are set to the "Unknown Repository".
Updated by Alessia Bardi about 3 years ago · 44 revisions