Project

General

Profile

D61 OpenAIRE Specification and Release Plan » History » Version 16

Paolo Manghi, 27/02/2015 12:25 PM

1 1 Paolo Manghi
h1. D6.1 OpenAIRE Specification and Release Plan (v1, 28th of February 2015)
2
3
{{toc}}
4
5
h2. Introduction
6
7
The aim of this document is to explain in detail how the overall software release plan as detailed in the DoW (Figure 1) will be accomplished by the technical partners. For this, it will illustrate the plan of design, development, testing, and integration into production of the infrastructure services to be delivered in WP6, WP8, WP9, and WP10 (Figure 2: note that WP7 will not contribute to the operational services). The plan’s technical activities will be supervised and led by CNR and carried out across the technical partners CNR, CERN, UoA, ARC, ICM, UBONN, and UNIBI, in synergy with the scientific partners EBI, JISC, COUPERIN, DANS, UMINHO. 
8
9 8 Paolo Manghi
p=. !{width:60%}planTHreePhases.png!
10 1 Paolo Manghi
_Figure 1 OpenAIRE2020 technical and training programme timeline overview._
11
12
The OpenAIREplus infrastructure service software will be delivered in three releases corresponding to the milestones illustrated in Table 1. Each release will follow a cycle of pilot › testing › integration› deployment schedule that lasts 2 months, and will incrementally containing new services developed in WP8, WP9, and WP10. The plan will describe the parallel activities envisaged to achieve these objectives. In order to interpret this structure and roadmap, in the next section we shall introduce an overview of the functional requirements of the OpenAIRE infrastructure in OpenAIRE2020, given in terms of the new end-user interaction scenarios and data management issues. 
13
14 8 Paolo Manghi
p=. _Table 1 - OpenAIREplus: Deliverables and Milestones_
15 1 Paolo Manghi
16
|_ Date_|_Phase_|_Milestone/deliverable_|_Services/functionalities_|
17
| M5 | OpenAIREplus data model definition | D8.1 | |
18
| M14 | Release 1 | M6.2 |	The release will incorporate all services that are production ready by M12: LOD v1, Resarch Impact v1, Data Flow Montoring v1, Innovative portal services v1, Stats and visualization v1, Zenodo v2, Visual workflow editor |
19
| M26 | Release 2 | M6.4| The release will incorporate all services that are production ready by M24: LOD v2, Research Impact v2, Data Flow Monitoring v2, Innovative portal services v1, Stats and visualization v1, Zenodo v2, data-citation interlinking prototype, anonymization service, clustering agorithms, entity resolution algorithms |
20
| M38 | Release 3 | M6.5 |	The release will incorporate all services that are production ready by M36: LOD v3, Research Impact v3, Data Flow Monitoring v3, Innovative portal services demonstration, OA Broker Service, Mining training algorithms, Usage Statistics Services, Map of academic relationships |
21
22 8 Paolo Manghi
p=. !tableWPs.png!
23 1 Paolo Manghi
_Figure 2 – Research Joint Activities: WP6, WP7, and WP8_
24
25
h2. Functional Requirements
26
27
In this section we shall present the main functionalities to be realized in OpenAIRE2020. To this aim, we shall first introduce the status of the OpenAIRE infrastructure, on top of which OpenAIRE2020 will provide extensions.
28
29
h3. OpenAIRE infrastructure today (28th of February 2015)
30
31
The OpenAIRE system implements an advanced data infrastructure, whose purpose is (i) enabling the collection of heterogeneous data, (ii) aggregation and processing of such data to generate a uniform information space, and (iii) offering access to the information space. 
32
33
*Collection of content from data sources* OpenAIRE collects and integrates the following data source typologies:
34
* Data and publication repositories: metadata records from institutional repositories, data repositories, and CRIS systems that comply to the OpenAIRE guidelines (http://guidelines.openaire.eu) and to the OpenAIRE dara acquisition policies (https://www.openaire.eu/openaire-s-content-acquisition-policy/document-details).
35
* Entity registries: metadata records from entity registries (authority list maintainers):
36
** Project funding information from the following databases: EC CORDA database (FP7 projects), WellCome Trust, and FCP (Portugal Ministry), which include information about projects and participants.
37
** Publication repository information from the OpenDOAR directory.
38
** Data repository information from the re3data.org directory.
39
* Document files: publication files for mining from data sources authorizing this process.
40
* Human claims: end users can “claim”, i.e. ingest, publication metadata records (via DOI or ORCID identifier), together with access rights and references to project funding (or EGI initiative). End-users can also claim relationships between publications and datasets.
41
* Zenodo: The OpenAIRE infrastructure supports the Zenodo repository whose purpose is to offer publication and data repository management functionalities for those authors who do not have a repository of reference and are willing to respect the OA mandates of the EC. Zenodo is registered into the infrastructure as a data and publication OpenAIRE compliant repository, hence all its content is aggregated in the information space.
42
* Validation: to be regularly aggreagated into the OpenAIRE information space, data sources must export metadata according to the indications provided by the guidelines. The infrastructure offers validation services capable of evaluating the degree of conformance of a repository and return useful feedbacks. Once the data source is validated, its managers can register it to the OpenAIRE infrastructure and make it available for data collection.
43
44
*Cleaning and enrichment of aggregated content* 
45
46
* Cleaning: the infrastructure operates a metadata harmonization framework that supports data curators at defining mapping rules between records collected from the data sources described above and the graph structure of the information space. Incoming records are mapped into sub-graphs and their properties are harmonized to match the OpenAIRE data model semantics (e.g., vocabulary conversion, format conversion).
47
* Mining: the infrastructure operates a mining framework that takes as input the OpenAIRE information space and the relative documents in order to identify missing properties and relationships about the entities. Tools allow for (i) extracting text from several document formats (PDF, HTML, XML) and (ii) run a set of inference modules, which can be dynamically plugged in or modified to match new configurations. Current modules include: classification, citation, similarity, publication-datasets relationships, publication-project relationships. 
48
* Statistics: the infrastructure operates services for the calculation and caching of elaborated statistics over the graph, in order to offer immediate access to such numbers. Statistics regard Open Access as well as impact in terms of publications with respect to projects, funding schemes, research initiatives.
49
* De-duplication: the infrastructure operates de-duplicaiton services that allow data curators to define/refine algorithms for the identification of duplicates across tens of millions of publications, organizations, and authors. The algorithms also take care of the redistribution of relationships within the graph.
50
51
*Access to the information space* As to data access, the data infrastructure currently offers a portal (www.openaire.eu) and APIs (http://api.openaire.eu):
52
* Discovery: the portal implements advanced search and browsing mechanisms over the graph-structured information space; access to statistics by funder and for EGI (European Grid Infrastructure). 
53
* Claims: End-users can manage the set of their claim actions. 
54
* Regisitration and validation of data sources: data sources managers can register their repository, CRIS system, journal, aggregator to the infrastructure and validate their content with respect to the guidelines.
55
* APIs: APIs allow for bulk, selective, or random access to the OpenAIRE information space. Specifically: HTTP rest queries and OAI-PMH APIs.
56
57 9 Paolo Manghi
p=. !{width:70%}OpenAIRE_InitialArchitecture.png!
58 1 Paolo Manghi
Figure 3 - OpenAIRE high-level architecture as is today
59
60
h3. OpenAIRE infrastructure in OpenAIRE2020
61
62 4 Paolo Manghi
OpenAIRE2020 will extend the OpenAIRE infrastructure with a number of functionalities enhancing the way the information space is populated and accessed. At the end of the project the architecture in Figure 3 will be enhanced to become the architecture in Figure 4. 
63 1 Paolo Manghi
64 9 Paolo Manghi
p=. !{width:70%}OpenAIRE_Feb2018Architecture.png!
65 4 Paolo Manghi
Figure 4 - OpenAIRE high-level architecture: end of OpenAIRE2020
66
67
The project plan has grouped the realisation of new functionalities into the following meta-actions and focused-actions:
68
69 1 Paolo Manghi
*Action 1 (WP8) Information Space management* This action includes the realization of services for:
70
* Data model update: upgrading the OpenAIRE data model and information space to include further information about licenses (IPRs), software entities, National funding schemes, DMPs
71 4 Paolo Manghi
* Linked Open Data Management
72 1 Paolo Manghi
* Research Impact Services
73
* Data flow monitoring
74
* Data-Literature Interlinking Prototype (from WP7, but strictly connected to information space activities).
75
* Usage statistics 
76
77
*Action 2 (WP9) Portal and end-user services*
78
* Open Access Repository Broker
79
* Portal: statistics and visualization tools
80
* Innovative portal services
81
* Data Anonymization 
82
* Zenodo repository
83
84
*Action 3 (WP10)*
85
* New Mining Algorithms: clustering, entity resolution
86
* Training algorithms
87
* Map of academia
88 2 Paolo Manghi
89 4 Paolo Manghi
In the next section we shall describe the detailed plan for the first 12 months of the project concerning the three actions above.
90 1 Paolo Manghi
91 5 Paolo Manghi
h2. Release Plan: first year (Feb 2015 - Feb 2016)
92 1 Paolo Manghi
93 3 Paolo Manghi
The release plan is organized into a number of actions whose aim is to accomplish the main actions illustrated in the previous section. Such actions have been identified based on the lines of action to be followed by subgroup of partners in the development of infrastructural services. The first project milestone (see Table 1) will be delivered as a combined and parallel effort of such actions in order to achieve the first-year architecture illustrated in Figure 6.
94 1 Paolo Manghi
95 9 Paolo Manghi
p=. !{width:70%}OpenAIRE_Feb2016Architecture.png!
96 3 Paolo Manghi
Figure 6 - OpenAIRE high-level architecture: February 2016
97
98 1 Paolo Manghi
*Dependencies* No particular dependencies among project tasks occur and this notably simplifies the activities, giving partners the possibility to work in parallel without too much conflicts. On the other hand, all services inherently depend on the data model and on its implementation by the information space services, which mainly include the HBASE cluster used for information space graph population and de-duplication. Any change affecting the data model or its “implementation” will inherently affect the other services depending on these, such as portal services, information mining services, and export services (OAI-PMH, statistics, LOD, OA repository broker, etc.). The OpenAIRE data model will be consolidated and release at Month 4, then other upgrades will follow at M18 and M30. Such releases will serve as starting point for potential change of actions in order to realign the services to potential changes in the schema. 
99
100 14 Paolo Manghi
*Development, BETA, production infrastructures* The plan described in the following is made possible by the activities carried out in WP6 which regard the overall software life-cycle: design, development, testing, integration, packaging, and deployment into production. Figure 5 illustrates the development-to-production scenario. CNR is responsible for the installation and maintenance (on-demand) of development infrastructures, where OpenAIRE2020 technical partners can independently or in synergy develop, test and integrate their services. ICM is responsible for the maintenance of the OpenAIRE production and BETA infrastructure, ensuring 24/7’s operation. Tested services are first deployed into the BETA infrastructure, which is supposed to emulate the production environment for final service integration testing and possible end-user validation (for new portal services). The BETA infra can reuse the same inference and de-duplication clusters of the production infrastructure, thereby emulating the “on-line” conditions met by the production system. As such the BETA is not supposed to be used as a development/testing environment, but should only contain the new upgrade ready for the next release. 
101 1 Paolo Manghi
102 9 Paolo Manghi
p=. !infrasArch.png!
103 1 Paolo Manghi
_Figure 5 OpenAIRE development, BETA and production infrastructure_
104
105 10 Paolo Manghi
The next 12 months are organised in the following actions, which correspond to partial or full accomplishment of the tasks described in the DoW document. The detailed plan for the tasks can be found by clicking on the individual wiki pages.
106 1 Paolo Manghi
107
h3. Action 1 (WP8) Information Space management
108
109 15 Paolo Manghi
* [[D8.1 OpenAIRE data model]] (Task leader: CNR): available
110 11 Paolo Manghi
* [[D8.2 LOD services]] (Task leader: UNIBONN): available
111
* [[D8.3 Research impact services]] (Task leader: ARC): available
112
* [[D8.4 Data flows and dynamics monitoring services]] (Task leader: CNR): available
113 1 Paolo Manghi
114
p=. _Table 2 - Deliverables and milestones in action 1: first 12 months_
115
|_Date_|_Document type_|_Document number_|_Title_|
116
| Apr 2015 | D | D8.1 | OpenAIRE data model |
117
| Jun 2015 | D | D8.2 | LOD services |
118
| Jul 2015 | D | D8.3 | Research impact services |
119
| Sep 2015 | D | D8.4 | Data flows and dynamics monitoring services |
120
| Dec 2015 | M | M8.1 | LOD services |
121
| Dec 2015 | M | M8.2 | Research impact services |
122
| Dec 2015 | M | M8.3 | Data flows and dynamics monitoring services |
123
124
h3. Action 2 (WP9) Portal and end-user services
125
126 12 Paolo Manghi
* [[D9.1 OpenAIRE portal and services]] (Task leader: UoA): to be provided
127
* [[D9.2 Stats, reporting, and visualization services]] (Task leader: UoA): to be provided
128
* [[D9.3 Zenodo service]] (Task leader: CERN): to be provided
129 1 Paolo Manghi
130 9 Paolo Manghi
p=. _Table 3 - Deliverables and milestones in action 1: first 12 months_
131 1 Paolo Manghi
132
|_Date_|_Document type_|_Document number_|_Title_|
133
| Jun 2015 | D | D9.1 | OpenAIRE portal and services |
134
| Aug 2015 | D | D9.2 | Stats, reporting, and visualization services |
135
| Jun 2015 | D | D9.3 | Zenodo service |
136
| Dec 2015 | M | M9.1 | OpenAIRE portal and services |
137
| Dec 2015 | M | M9.2 | Stats, reporting, and visualization services |
138
| Dec 2015 | M | M9.3 | Zenodo service |
139
140
h3. Action 3 (WP10) Information mining
141
142 16 Paolo Manghi
* [[D10.1 Visual workflow editor]] (Task leader: ICM): available
143 1 Paolo Manghi
144 9 Paolo Manghi
p=. _Table 4 - Deliverables and milestones in action 1: first 12 months_
145 1 Paolo Manghi
146
|_Date_|_Document type_|_Document number_|_Title_|
147 2 Paolo Manghi
| Aug 2015 (error in DoW: October 2015) | D | D10.1 | Visual workflow editor |
148 1 Paolo Manghi
| Dec 2015 | M | M10.1 | Visual workflow editor |
149 5 Paolo Manghi
150 1 Paolo Manghi
h2. Release Plan: second year (Feb 2016 - Feb 2017)
151 6 Paolo Manghi
152
To be defined.
153 5 Paolo Manghi
154
h2. Release Plan: third year (Feb 2017 - Feb 2018)
155 7 Paolo Manghi
156
To be defined.