Actions

History

DOIBoost » History » Revision 31

« Previous | Revision 31/53 (diff) | Next »
Claudio Atzori, 11/11/2021 05:39 PM

DOIBoost¶

Table of contents
DOIBoost
- DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID

DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID¶

The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:

La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071

In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.

Inputs¶

Crossref: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
Microsoft Academic Graph: downloaded version on 2021-02-15. We plan to take the latest version in Dec 2021 before MAG will be retired.
ORCID: baseline dump obtained in 2020-10-13, regularly updated every week from the ORCID public API
Unpaywall: public database snapshot downloaded in March 2021. Unpaywall updates it twice a year (https://unpaywall.org/products/snapshot)

The construction of the DOIBoost dataset consists of the following phases:

1 Filtering¶

Records in Crossref are ruled out according to the following criteria

have blank title
have one of the following publishers: "Test accounts", "CrossRef Test Account"
have no authors with valid names, where valid means: not blank and different from all strings in this list: List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")
have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
have not one of the following values in the field type:
- "book-section"
- "book"
- "book-chapter"
- "book-part"
- "book-series"
- "book-set"
- "book-track"
- "edited-book"
- "reference-book"
- "monograph"
- "journal-article"
- "dissertation"
- "other"
- "peer-review"
- "proceedings"
- "proceedings-article"
- "reference-entry"
- "report"
- "report-series"
- "standard"
- "standard-series"
- "posted-content"
- "dataset"

Records with type=dataset are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.

2 Mapping Crossref properties into the OpenAIRE Research Graph¶

Properties in OpenAIRE results are set based on the logic described in the following table:

TODO: ensure we use the field names of the public dump

OpenAIRE Result field path	Crossref path(s)	Notes
pid	doi, clinical-trial-number, alternative-id	the doi is normalised and lowered case. Is it correct to map alternate-id(s) as PIDs?
dateofcollection	indexed.datetime
collectedfrom.name		Default value "Crossref"
collectedfrom.id		TODO Default value ID
publisher	publisher
title	title	as main title
title	original-title, short-title	as alternative title
title	subtitle	as subtitle
description	abstract
source	source	only if the record is not of type `book`
source	`${container-title.head} ISBN: ${ISBN.head}`	only if the record is of type `book`
dateofacceptance	issued.datetime or, if not available, created.datetime
relevantdate	created.datetime posted.datetime accepted.datetime published-print published-online
subject	subject	with classid='keywords', i.e. no controlled vocabularies for Crossref subjects
author	author	if available the sequence is mapped to rank and the ORCID is also mapped (as 'orcid_pending')
journal		only for publications
journal.name	container-title.head
journal.eissn	issn-type.value	if issn-type.type='electronic'
journal.issn	issn-type.value	if issn-type.type='print'
journal.vol	volume
journal.sp	page	before '-'
journal.ep	page	after '-'
instance		TODO One instance is created . . .
instance.license	license.URL	If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used.
instance.pid		the list of pids as in the first row of this table
instance.refereed		set to 'peerReviewed' only if `relation.has-review.id` is not empty
instance.instancetype	subtype	mapped using the OpenAIRE vocabularies
instance.collectedfrom		as in result.collectedfrom above
instance.dateofacceptance		as in result.dateofacceptance above
instance.url	URL, link.URL	there may be different URLs in the same instance
instance.accessright.value		based on license and dateofacceptance: - UNKNOWN: if the license is blank - OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date - EMBARGO: OUP license, before 12 months from the publication date - CLOSED: if there is a license not covered by the previous cases
instance.accessright.openaccessroute		only if instance.accessright.value = 'OPEN ACCESS'. Default is 'hybrid'. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.

2 Map Crossref links to projects/funders¶

Links to funding available in Crossref are mapped as funding relationships (result -- isProducedBy --> project) applying the following mapping:

funder	grant code	Link to
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Union’s Horizon 2020 research and innovation program'	series of 4-9 digits in `award`	Link to H2020 project
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}	series of 4-9 digits in `award`	Link to FP7 project
DOI: 10.13039/501100000781 OR name: 'European Union's'	series of 4-9 digits in `award`	Link to FP7 or H2020 project
DOI: 10.13039/100000001	`award`	Link to NSF project
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'}	`award`	Link to ANR project
DOI: 10.13039/501100002341	`award`	Link to Academy of Finland project
DOI: 10.13039/501100001602	`award`, removing the initial 'SFI' if present	Link to SFI project
DOI: 10.13039/501100000923	`award`	Link to ARC project
DOI: 10.13039/501100000038	`award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE	Link to NSERC (`unidentified` project)
DOI: 10.13039/501100000155	`award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE	Link to SSHRC (`unidentified` project)
DOI: 10.13039/501100000024	`award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE	Link to CIHR (`unidentified` project)
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado'	`award`	Link to CONICYT project
DOI: 10.13039/501100003448	series of 4-9 digits in `award`	Link to GSRT project
DOI: 10.13039/501100010198	`award`	Link to SGOV project
DOI: 10.13039/501100004564	series of 4-9 digits in `award`	Link to MESTD project
DOI: 10.13039/501100003407	`award`	Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified` project) is also generated
DOI: {10.13039/501100006588, 10.13039/501100004488}	`award`, removing 'Project No' and 'HRZZ' prefix, if present	Link to HRZZ or MZOS project
DOI: 10.13039/501100006769	`award`	Link to Russian Science Foundation project
DOI: 10.13039/501100001711	`award` after '_' and before '/'	Link to SNSF project
DOI: 10.13039/501100004410	`award`	Link to TUBITAK project
DOI: 10.10.13039/100004440 or name: 'Wellcome Trust Masters Fellowship'	`award`	Link to Wellcome Trust specific project and to the `unidentified` project.

3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)¶

The fields we consider from Unpaywall are:

is_oa
best_oa_location
oa_status

The results of Crossref that intersect by DOI with Unpaywall records are enriched with:

TODO: ensure we refer to json fields of the public dump

OpenAIRE Result field path	Unpaywall field path	Notes
result.instance		created only if `is_oa` and a `best_oa_location` is available
result.instance.collectedfrom.name		default value "Unpaywall"
result.instance.collectedfrom.id		default value TODO
result.instance.url	`best_oa_location`
result.instance.license	`best_oa_location.license`
result.instance.pid	`doi`
result.instance.accessright		default value Open Access: we do not add instances if Unpaywall says there is no open version
result.instance.accessright.route	`oa_status`

For the definition of Unpaywall's oa_status refer to the Unpaywall FAQ

4 Intersect DOIBoost1 with ORCID (DOIBoost2)¶

The fields we consider from ORCID are:

doi
authors, a list of authors, each with optional name, surname, creditName, oid

OpenAIRE field path	ORCID path	Notes
pid	doi
author.name	capitalize(name)	only mapped if not blank
author.surname	capitalize(surname)	only mapped if not blank
author.fullname		if name and surname are not blank, they are concatenated (capitalize(name) capitalize(surname)), otherwise we use the creditName
author.pid	oid	as confirmed ORCID identifier (in contrast to the 'orcid_pending' set from Crossref and Unpaywall

The records are enriched with the ORCID identifiers of their authors.

if the number of authors from Crossref equals the size of authors from ORCID, then we pick the list of authors with more PIDs and try to enrich it with the PIDs from the other list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available;
if the number of authors are different, then we take the longest and try to enrich it with the PIDs from the other author list, based on JaroWrinkler distance on on authors' names, surnames, or fullnames, depending on which properties are available

TODO: How do we ensure that if an author comes with an orcid_pending from Crossref and one orcid from ORCID, the last wins?

5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)¶

Important Notes

Only papers with DOI are considered
Since for the same DOI we have multiple version of item with different MAG PaperId, we only take one per DOI (the last one we process). We call this dataset Papers_distinct

When mapping MAG records to the OpenAIRE Research Graph, we consider the the following MAG tables:

PaperAbstractsInvertedIndex: for the paper abstracts
Authors: for the authors. The MAG data is pre-processed by grouping authors by PaperId
Affiliations and PaperAuthorAffiliations: to generate links between publications and organisations
Journals and ConferenceInstances: joined with Papers_distinct to have the information about the venues where the paper was published
TO BE REMOVED PaperUrls: to create one instance for the OpenAIRE publication
FieldsOfStudy: to add subjects

The records are enriched with:

abstracts
MAG identifiers of authors
affiliation relationships
subjects (MAG FieldsOfStudy)
conference or journal information (in the journal field) TODO: or container, in case of the dump?
[TO BE REMOVED] instances with URL from MAG

TODO: ensure we use the field names of the public dump

OpenAIRE path	._MAG table	MAG path(s)	Notes
pid	Papers_distinct	Doi
originalId	Papers_distinct	PaperId
title	Papers_distinct	PaperTitle	as main title
title	Papers_distinct	OriginalTitle	as alternative title
source	Papers_distinct	BookTitle
dateofacceptance	Papers_distinct	Date	first 10 chars, if not blank
publisher	Papers_distinct, Journal	Publisher
description	PaperAbstractsInvertedIndex	IndexedAbstract
journal	ConferenceInstances, Journals
journal.name		DisplayName
journal.conferencePlace		Location
journal.conferenceDate		StartDate and EndDate	Values created as concatanation of the first 10 chars of each (separated by '-'), if both are not blank
journal.sp		FirstPage
journal.ep		EndPage
journal.issnPrinted		Issn
journal.vol	Papers_distinct	Volume
journal.iss	Papers_distinct	Issue
subject	FieldsOfStudy	subjects	All subjects from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG. We create one subject per DisplayName, per MainType and, if the MainType is in the format `x.y`, one subject also for the first token (i.e. `x`)
subject.value		subjects.DisplayName, subjects.MainType, split(subjects.MainType, '.'0.head	All subject from MAG are set with a dedicated marker in the classname/classid 'Microsoft Academic Graph Classification'/MAG
author	Authors, PaperAuthorAffiliations
author.rank		sequenceNumber
author.fullName		DisplayName	if not blank
author.affiliation		affiliation	if not null
author.pid		AuthorId	MAG id of the author as URL. TO BE REMOVED?
instance	PaperUrls		TO BE REMOVED. Currently maps to the MAG URl and to any URLs in SourceUrl

6 Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information¶

In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a journal.[l|e]issn that match are enriched as follows:

Each instance gain the hostedby information corresponding to the journal
If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)

The hostedby of records that do not match are set to the "Unknown Repository".

Files (0)

Updated by Claudio Atzori over 4 years ago · 31 revisions

Project

General

Profile

Documentation

Wiki

DOIBoost » History » Revision 31

DOIBoost¶

DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID¶

Inputs¶

1 Filtering¶

2 Mapping Crossref properties into the OpenAIRE Research Graph¶

2 Map Crossref links to projects/funders¶

3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)¶

4 Intersect DOIBoost1 with ORCID (DOIBoost2)¶

5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)¶

6 Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information¶

Project

General

Profile

Documentation

Wiki

DOIBoost » History » Revision 31

DOIBoost¶

DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID¶

Inputs¶

1 Filtering¶

2 Mapping Crossref properties into the OpenAIRE Research Graph¶

2 Map Crossref links to projects/funders¶

3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)¶

4 Intersect DOIBoost1 with ORCID (DOIBoost2)¶

5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)¶

6 Enrich DOIBoost3 with hosting data sources (hostedby) and access right information¶

6 Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information¶