Task 8.2 LOD services

Leader: UBONN. Participants: ARC, CNR

D8.2 LOD Services

  • OpenAIRE aims at increasing its technical interoperability, engaging with additional user communities, and exploring synergies with and adding value to related open content initiatives (e.g. Open Educational Resources). As a major step towards accomplishing this goal, we are now exposing all data from the OpenAIRE Information Space in a way that ensures maximum reusability for developers of third-party applications and services. Such applications and services may include statistical analyses beyond those in the scope of OpenAIRE2020 itself, efforts aggregating OpenAIRE and other data such as research data, or tools that support scientific writing and communication, e.g. bibliography managers. Our target audience also comprises end users with sufficient technical skills, whereas the existing OpenAIRE web portal addresses a non-technical audience. We achieve maximum reusability by following the best practices to publish data in a self-describing way on the Web under an open license, i.e. as Linked Open Data (LOD), and offering three different ways to access the data: 1. exploring data records about individual entities in the OpenAIRE Information Space with the possibility to follow links to related entities, 2. downloading an all-in-one data dump, 3. interactively querying the data. We are planning to add further services, e.g., for visual exploration. Our focus so far was on laying the foundations for publishing the OpenAIRE data as LOD: 1. implementing the OpenAIRE data model as a linked data vocabulary (also called an RDF schema), 2. mapping all entities from the OpenAIRE Information Space to RDF resources, i.e. to the graph-based model of Linked Data, 3. publishing the data in the three ways introduced above. Immediate next steps, which currently are in preparation, include 4. the automation of the mapping, and 5. interlinking the OpenAIRE LOD with other Linked Open Datasets to further enhance their value.
    1. Vocabulary specification: Given the rich OpenAIRE data model, the main challenges were to identify the most suitable vocabularies for reuse, but also to define our own, OpenAIRE specific vocabulary terms for fields not covered by existing Linked Data vocabularies.
    2. Mapping existing data sources to RDF: The data of the OpenAIRE Information Space is available in the three source formats HBase, XML and CSV. From each of them it is possible to implement a translation to RDF. We aimed at choosing the one source format that would allow for the best translation in terms of performance and of the maintainability of the mapping definition
    3. Automating the Mapping: The mapping of the most recent OpenAIRE data to LOD is not yet automated, but with the choice of a high-performance mapping implementation and the deployment of this implementation on the OpenAIRE server infrastructure we are prepared for automating the process to generate up-to-date LOD every day.
    4. Interlinking the OpenAIRE LOD to related datasets: There exist multiple other Linked Open Datasets covering the same domain as OpenAIRE, or closely related domains. We are planning to interlink the OpenAIRE LOD with these datasets to enable more comprehensive information retrieval tasks.

Subtasks

RDF production:

  • map all metadata objects in the OpenAIRE Information Space onto suitable standard vocabularies (e.g. Dublin Core, SIOC, EDM, CERIF LD)
  • made these metadata objects available as Linked Open Data as data dumps being published in regular intervals with more frequently published incremental updates
  • liaise with all relevant communities (PSI, DBpedia, LOD, W3C SWEO etc.) to leverage and outreach to additional stakeholders and multipliers (UBONN and ARC)
  • precondition: CNR will provide technical support for synchronizing content of the OpenAIRE Information Space with LOD services and vice versa, in the case content can be moved from enriched LOD representation to the OpenAIRE Information Space.
  • expected outcome: OpenAIRE will increase its technical interoperability, engage with additional user communities and explore synergies with and added value to related open content initiatives (e.g. in the Open Educational Resources).

Interlinking:

  • link OpenAIRE LOD objects with other Linked Data resources such as DBLP
    DBLP++
    CEUR-WS.org
    CiteSEER
    SWDF
    DBpedia
    BibBase
    ACM
    IEEE
    BNB
    COLINDA
    GEO-NAMES
    ePrints
    The European Library
    B3Kat
    CORE

Task Timeline (Including Deliverables & Milestones)

  • M6: D8.2 LOD Services. The deliverable will describe the technical deployment of LOD services, together with their integration with the OpenAIRE information space, in terms of data (mappings from OpenAIRE data model to LOD structure and standard vocabularies) and workflows. [UBONN, R]
  • M12, M24, M36: M8.1 LOD services. The service software will be released in three stages, in order to match the three main releases of the OpenAIRE data model.

Areas of priority (where to concentrate first)

  1. Specify mapping from the OpenAIRE data model to LOD vocabularies
  2. Explore technical ways of producing LOD:
    1. From the CSV so far produced as intermediate files for generating statistics (potential issue: might not contain sufficiently complete information)
    2. From CSV data generated by a modified implementation, which keeps all required information (potential issue: not efficient to process with off-the-shelf CSV→RDF tools because of redundancies in the data)
    3. Directly from HBase, using a Map/Reduce job similar to the one generating the above CSV (potential issue: harder to adapt w.r.t. data model changes or new vocabularies; can’t reuse off-the-shelf tools)

Discussion (https://issue.openaire.research-infrastructures.eu/issues/1089)

Forseen Integration with other Work Packages and Tasks

  • T4.4 (Guidelines for Data providers and OpenAIRE service APIs): Discuss the possibility to import data into the OpenAIRE Information Space that exists natively in the form of LOD.
  • T9.2 (Statistics, reporting and visualization services): Some extensions to these services could potentially be implemented in a straightforward way as SPARQL queries over the LOD, if that’s sufficiently scalable.
  • T10.4 (Scholarly communication network analysis): By the same argument this could also be done on top of the LOD graph.

Communication Strategy: when and how to raise awareness among consortium of updates in task

  • Around M3/M4, when the mapping to LOD vocabularies has been specified: “These are the vocabularies we consider reasonable; any comments?”
  • First before M6, once more in the second half of Year 1: ask partners to check the LOD that we can so far produce from the metadata in the OpenAIRE Information Space. “Is it correct/complete?”
  • Some time in Year 2 (once all of our metadata have been mapped to LOD): discuss candidate LOD datasets to which we would like to identify links (starting from the list given in the description of this task).