Project

General

Profile

OpenAIRE entity identifier and PID mapping policy » History » Version 3

Claudio Atzori, 07/12/2021 02:21 PM

1 1 Alessia Bardi
h1. OpenAIRE entity identifier and PID mapping policy
2
3
(copied from https://docs.google.com/document/d/1PnvZpmhbanJu3AeOT-zdIyMKIHoGKC4_Z0UtDFDZAeM/edit#)
4
5 3 Claudio Atzori
OpenAIRE assigns internal identifiers for each object it collects. 
6 1 Alessia Bardi
By default, the internal identifier is generated as @sourcePrefix::md5(localId)@ where 
7
* @sourcePrefix@ is a namespace prefix of 12 chars assigned to the data source at registration time
8
* @localid@ is the identifier assigned to the object by the data source
9
10
After years of operation, we can say that:
11
* @localId@ are unstable
12
* objects can disappear from sources
13
* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
14
15
Therefore, when the record is collected from an authoritative source:
16
* the identity of the record is forged using the PID, like @pidTypePrefix::md5(lowercase(doi))@
17
* the PID is added in a @pid@ element of the data model.
18
19
When the record is collected from a source which is not authoritative for any type of PID:
20
* the identity of the record is forged as usual using the local identifier;
21
* the PID, if available, is added as @alternateIdentifier@s
22
23
As of November 2021, the following data sources are used as "PID authorities": 
24
25 2 Claudio Atzori
|_. PID Type |_. Prefix (12 chars) |_. Authority |
26 1 Alessia Bardi
| doi | @doi_________@ | Crossref, Datacite, Zenodo |
27
| pmc | @pmc_________@ | Europe PubMed Central, PubMed Central |
28
| pmid | @pmid________@ | Europe PubMed Central, PubMed Central |
29
| arXiv | @arXiv_______@ | arXiv.org e-Print Archive |
30
| handle | @handle______@ | any repository |
31
32
TODO: WHAT HAPPENS FOR RECORDS WITH BOTH pmc and pmid? pmc wins?
33
34
OpenAIRE also perform duplicate identification (see dedicated section for details).
35
All duplicates are "merged" together in a "representative record" which must be assigned to a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).
36
The following strategy is applied to generate the OpenAIRE identifier of a representative record, to ensure it is as stable as possible:
37
38
TODO