DOIBoost » History » Revision 6
« Previous |
Revision 6/53
(diff)
| Next »
Alessia Bardi, 10/11/2021 01:00 PM
DOIBoost¶
DOIBoost: Crossref, Unpaywall, Microsoft Academic Graph, ORCID¶
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
- La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: https://doi.org/10.5281/zenodo.1441071
In short, the goal is to enrich the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
Inputs¶
- Crossref: dump available to Crossref subscribers via MetadataPlus service, updated once a month.
- Micorsoft Academic Graph: downloaded version on 2021-02-15. We plan to take a latest version on Dec 2021 before MAG will be retired.
- ORCID: baseline dump obtained in XX/XX/XXXX from URL, regularly updated every XX XX from the ORCID API available at URL
- Unpaywall: dump published on XX/XX/XXXX.
The generation of DOIBoost consists in the following phases:
1 Filter Crossref records that¶
- have blank title
- have one of the following publishers: "Test accounts", "CrossRef Test Account"
- have no authors with valid names, where valid means: not blank and different from all strings in this list:
List(",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na;")
- have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
2 Map Crossref links to projects/funders¶
Links to funding available in Crossref are mapped as funding relationships (result
-- isProducedBy
--> project
) applying the following mapping:
funder | grant code | Link to |
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Union’s Horizon 2020 research and innovation program' |
series of 4-9 digits in award |
Link to H2020 project |
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} | series of 4-9 digits in award |
Link to FP7 project |
DOI: 10.13039/501100000781 OR name: 'European Union's' | series of 4-9 digits in award |
Link to FP7 or H2020 project |
DOI: 10.13039/100000001 | award |
Link to NSF project |
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} | award |
Link to ANR project |
DOI: 10.13039/501100002341 | award |
Link to Academy of Finland project |
DOI: 10.13039/501100001602 | award , removing the initial 'SFI' if present |
Link to SFI project |
DOI: 10.13039/501100000923 | award |
Link to ARC project |
DOI: 10.13039/501100000038 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to NSERC (unidentified project) |
DOI: 10.13039/501100000155 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to SSHRC (unidentified project) |
DOI: 10.13039/501100000024 | award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE |
Link to CIHR (unidentified project) |
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' | award |
Link to CONICYT project |
DOI: 10.13039/501100003448 | series of 4-9 digits in award |
Link to GSRT project |
DOI: 10.13039/501100010198 | award |
Link to SGOV project |
DOI: 10.13039/501100004564 | series of 4-9 digits in award |
Link to MESTD project |
DOI: 10.13039/501100003407 | award |
Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified project) is also generated |
DOI: {10.13039/501100006588, 10.13039/501100004488} | award , removing 'Project No' and 'HRZZ' prefix, if present |
Link to HRZZ or MZOS project |
DOI: 10.13039/501100006769 | award |
Link to Russian Science Foundation project |
DOI: 10.13039/501100001711 | award after '_' and before '/' |
Link to SNSF project |
DOI: 10.13039/501100004410 | award |
Link to TUBITAK project |
DOI: 10.10.13039/100004440 or name: 'Wellcome Trust Masters Fellowship' | award |
Link to Wellcome Trust specific project and to the unidentified project. |
3 Intersect Crossref with Unpaywall by DOI (DOIBoost1)¶
The records are enriched with
- TODO: AUTHORS?
- one
instance
with- the
best_oa_location
of Unpaywall color
set as follows:green
if the host is a repository;gold
if the host is publisher and the journal is open access;hybrid
if the host is publisher, the journal is not open access but there is a license;bronze
if no license is available.
- the
4 Intersect DOIBoost1 with ORCID (DOIBoost2)¶
The records are enriched with the ORCID identifiers of their authors
5 Intersect DOIBoost2 with Microsoft Academic Graph (DOIBoost3)¶
The records are enriched with:- abstracts
- MAG identifiers of authors
- affiliation relationships
- subjects (MAG FieldsOfStudy)
- conference or journal information (in the
journal
field) TODO: orcontainer
, in case of the dump? - [TO BE REMOVED] instances with URL from MAG
6 Enrich DOIBoost3 with hosting data sources (hostedby
) and access right information¶
In this phase we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (issn, eissn, lissn) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a journal.[l|e]issn
that match are enriched as follows:
- Each instance gain the
hostedby
information. - If the journal is open access, the access rights of the instances are also set to "Open Access" with "gold" route.
The hostedby of records that do not match are set to the "Unknown Repository".
Updated by Alessia Bardi about 3 years ago · 6 revisions