Project

General

Profile

OpenAIRE entity identifier and PID mapping policy » History » Revision 2

Revision 1 (Alessia Bardi, 05/11/2021 03:54 PM) → Revision 2/3 (Claudio Atzori, 09/11/2021 03:06 PM)

h1. OpenAIRE entity identifier and PID mapping policy 

 (copied from https://docs.google.com/document/d/1PnvZpmhbanJu3AeOT-zdIyMKIHoGKC4_Z0UtDFDZAeM/edit#) 

 OpenAIRE assign internal identifiers for each object it collects.  
 By default, the internal identifier is generated as @sourcePrefix::md5(localId)@ where  
 * @sourcePrefix@ is a namespace prefix of 12 chars assigned to the data source at registration time 
 * @localid@ is the identifier assigned to the object by the data source 

 After years of operation, we can say that: 
 * @localId@ are unstable 
 * objects can disappear from sources 
 * PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos) 

 Therefore, when the record is collected from an authoritative source: 
 * the identity of the record is forged using the PID, like @pidTypePrefix::md5(lowercase(doi))@ 
 * the PID is added in a @pid@ element of the data model. 

 When the record is collected from a source which is not authoritative for any type of PID: 
 * the identity of the record is forged as usual using the local identifier; 
 * the PID, if available, is added as @alternateIdentifier@s 

 As of November 2021, the following data sources are used as "PID authorities":  

 |_. | PID Type |_. | Prefix (12 chars) |_. | Authority | 
 | doi | @doi_________@ | Crossref, Datacite, Zenodo | 
 | pmc | @pmc_________@ | Europe PubMed Central, PubMed Central | 
 | pmid | @pmid________@ | Europe PubMed Central, PubMed Central | 
 | arXiv | @arXiv_______@ | arXiv.org e-Print Archive | 
 | handle | @handle______@ | any repository | 

 TODO: WHAT HAPPENS FOR RECORDS WITH BOTH pmc and pmid? pmc wins? 

 OpenAIRE also perform duplicate identification (see dedicated section for details). 
 All duplicates are "merged" together in a "representative record" which must be assigned to a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). 
 The following strategy is applied to generate the OpenAIRE identifier of a representative record, to ensure it is as stable as possible: 

 TODO