OpenAIRE entity identifier and PID mapping policy » History » Revision 2
Revision 1 (Alessia Bardi, 05/11/2021 03:54 PM) → Revision 2/3 (Claudio Atzori, 09/11/2021 03:06 PM)
h1. OpenAIRE entity identifier and PID mapping policy (copied from https://docs.google.com/document/d/1PnvZpmhbanJu3AeOT-zdIyMKIHoGKC4_Z0UtDFDZAeM/edit#) OpenAIRE assign internal identifiers for each object it collects. By default, the internal identifier is generated as @sourcePrefix::md5(localId)@ where * @sourcePrefix@ is a namespace prefix of 12 chars assigned to the data source at registration time * @localid@ is the identifier assigned to the object by the data source After years of operation, we can say that: * @localId@ are unstable * objects can disappear from sources * PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos) Therefore, when the record is collected from an authoritative source: * the identity of the record is forged using the PID, like @pidTypePrefix::md5(lowercase(doi))@ * the PID is added in a @pid@ element of the data model. When the record is collected from a source which is not authoritative for any type of PID: * the identity of the record is forged as usual using the local identifier; * the PID, if available, is added as @alternateIdentifier@s As of November 2021, the following data sources are used as "PID authorities": |_. | PID Type |_. | Prefix (12 chars) |_. | Authority | | doi | @doi_________@ | Crossref, Datacite, Zenodo | | pmc | @pmc_________@ | Europe PubMed Central, PubMed Central | | pmid | @pmid________@ | Europe PubMed Central, PubMed Central | | arXiv | @arXiv_______@ | arXiv.org e-Print Archive | | handle | @handle______@ | any repository | TODO: WHAT HAPPENS FOR RECORDS WITH BOTH pmc and pmid? pmc wins? OpenAIRE also perform duplicate identification (see dedicated section for details). All duplicates are "merged" together in a "representative record" which must be assigned to a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). The following strategy is applied to generate the OpenAIRE identifier of a representative record, to ensure it is as stable as possible: TODO