1
|
<div class="about">
|
2
|
<div class="uk-section">
|
3
|
<div class="uk-margin-large-left uk-margin-medium-bottom">
|
4
|
<breadcrumbs [breadcrumbs]="breadcrumbs"></breadcrumbs>
|
5
|
</div>
|
6
|
<div class="firstBackground">
|
7
|
<div class="uk-container">
|
8
|
<h2 class="uk-text-center">About</h2>
|
9
|
<div class="uk-flex uk-flex-center">
|
10
|
<div class="uk-padding-small uk-width-4-5@m">
|
11
|
<p>
|
12
|
Open Science is gradually becoming the modus operandi in research practices, affecting the way researchers
|
13
|
collaborate and publish, discover, and access scientific knowledge.
|
14
|
Scientists are increasingly publishing research results beyond the article, to share all scientific
|
15
|
products (metadata and files) generated during an experiment, such as datasets, software, experiments.
|
16
|
They publish in scholarly communication data sources (e.g. institutional repositories, data archives,
|
17
|
software repositories), rely where possible on persistent identifiers (e.g. DOI, ORCID, Grid.ac, PDBs),
|
18
|
specify semantic links to other research products (e.g. supplementedBy, citedBy, versionOf), and possibly
|
19
|
to projects and/or relative funders.
|
20
|
By following such practices, scientists are implicitly constructing the Global Open Science Graph, where
|
21
|
by "graph" we mean a collection of objects interlinked by semantic relationships.
|
22
|
<br><br>
|
23
|
The OpenAIRE Research Graph includes metadata and links between scientific products (e.g. literature,
|
24
|
datasets, software, and "other research products"), organizations, funders, funding streams, projects,
|
25
|
communities, and (provenance) data sources - the details of the <a
|
26
|
href="https://zenodo.org/record/2643199#.XOqdstMzZ24" target="_blank">graph data model</a> can be found
|
27
|
in Zenodo.org.
|
28
|
<br><br>
|
29
|
The Graph is available and obtained as an aggregation of the metadata and links collected from ~70.000
|
30
|
trusted sources, further enriched with metadata and links provided by:</p>
|
31
|
<ul class="portal-circle">
|
32
|
<li class="uk-margin-bottom">OpenAIRE end-users, e.g. researchers, project administrators, data curators
|
33
|
providing links from scientific products to projects, funders, communities, or other products;
|
34
|
</li>
|
35
|
<li class="uk-margin-bottom">OpenAIRE Full-text mining algorithms over around ~10Mi Open Access article
|
36
|
full-texts;
|
37
|
</li>
|
38
|
<li>Research infrastructure scholarly services, bridged to the graph via OpenAIRE, exposing metadata of
|
39
|
products such as research workflows, experiments, research objects, software, etc..
|
40
|
</li>
|
41
|
</ul>
|
42
|
</div>
|
43
|
</div>
|
44
|
</div>
|
45
|
</div>
|
46
|
</div>
|
47
|
<div id="architecture" class="uk-container uk-section">
|
48
|
<div class="uk-padding-small">
|
49
|
<h2 class="uk-text-center">Architecture</h2>
|
50
|
<div class="uk-flex uk-flex-center">
|
51
|
<div class="uk-width-4-5@m">
|
52
|
<h3 class="uk-margin-medium-top portal-color">How we build it</h3>
|
53
|
<div>
|
54
|
<p>
|
55
|
OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the
|
56
|
world, including Open Access institutional repositories, data archives, journals.
|
57
|
All the metadata records (i.e. descriptions of research products) are put together in a data lake,
|
58
|
together
|
59
|
with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national
|
60
|
and international funders.
|
61
|
Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications
|
62
|
enrich
|
63
|
the content of the data lake with links between research results and projects, author affiliations,
|
64
|
subject
|
65
|
classification, links to entries from domain-specific databases.
|
66
|
Duplicated organisations and results are identified and merged together to obtain an open, trusted, public
|
67
|
resource enabling explorations of the scholarly communication landscape like never before.
|
68
|
</p>
|
69
|
|
70
|
</div>
|
71
|
</div>
|
72
|
</div>
|
73
|
<div class="uk-flex uk-flex-center uk-inline uk-margin-medium-top">
|
74
|
<img [src]="'assets/graph-assets/about/architecture/'+architectureImage"
|
75
|
class="uk-width-4-5 architecture-image">
|
76
|
|
77
|
<a class="uk-position-absolute uk-transform-center" style="left: 27%; top: 48%"
|
78
|
(click)="goTo('tabs_card'); changeTab(0)"
|
79
|
(mouseenter)="architectureImage = 'aggregation_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
80
|
<img [class]="(architectureImage == 'aggregation_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
81
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point aggregation">
|
82
|
</a>
|
83
|
<a class="uk-position-absolute uk-transform-center" style="left: 47%; top: 48%"
|
84
|
(click)="goTo('tabs_card'); changeTab(1)"
|
85
|
(mouseenter)="architectureImage = 'deduplication_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
86
|
<img [class]="(architectureImage == 'deduplication_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
87
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point deduplication">
|
88
|
</a>
|
89
|
<a class="uk-position-absolute uk-transform-center" style="left: 58%; top: 48%"
|
90
|
(click)="goTo('tabs_card'); changeTab(2)"
|
91
|
(mouseenter)="architectureImage = 'enrichment_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
92
|
<img [class]="(architectureImage == 'enrichment_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
93
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point enrichment">
|
94
|
</a>
|
95
|
<a class="uk-position-absolute uk-transform-center" style="left: 70%; top: 48%"
|
96
|
(click)="goTo('tabs_card'); changeTab(3)"
|
97
|
(mouseenter)="architectureImage = 'post_cleaning_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
98
|
<img [class]="(architectureImage == 'post_cleaning_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
99
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point post cleaning">
|
100
|
</a>
|
101
|
<a class="uk-position-absolute uk-transform-center" style="left: 76%; top: 32%"
|
102
|
(click)="goTo('tabs_card'); changeTab(4)"
|
103
|
(mouseenter)="architectureImage = 'indexing_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
104
|
<img [class]="(architectureImage == 'indexing_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
105
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point indexing">
|
106
|
</a>
|
107
|
<a class="uk-position-absolute uk-transform-center" style="left: 76%; top: 72%"
|
108
|
(click)="goTo('tabs_card'); changeTab(5)"
|
109
|
(mouseenter)="architectureImage = 'stats_analysis_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
|
110
|
<img [class]="(architectureImage == 'stats_analysis_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
|
111
|
src="assets/graph-assets/about/architecture/marker.gif" alt="action point stats analysis">
|
112
|
</a>
|
113
|
</div>
|
114
|
<div id="tabs_card"
|
115
|
class="uk-margin-xlarge-top uk-padding-small">
|
116
|
<div class="uk-card uk-card-default uk-card-body architecture-card">
|
117
|
<ul #tabs uk-tab class="uk-tab">
|
118
|
<li><a>Aggregation</a></li>
|
119
|
<li><a>Deduplication</a></li>
|
120
|
<li><a>Enrichment</a></li>
|
121
|
<li><a>Post-Cleaning</a></li>
|
122
|
<li><a>Indexing</a></li>
|
123
|
<li><a>Stats Analysis</a></li>
|
124
|
</ul>
|
125
|
|
126
|
<ul class="uk-switcher uk-margin">
|
127
|
<li>
|
128
|
<!-- uk-grid-->
|
129
|
<div class=" uk-margin-large-top uk-text-small">
|
130
|
<!-- <div class="uk-width-3-5@m">-->
|
131
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
132
|
src="assets/graph-assets/about/architecture/aggregation.png" alt="Aggregation">
|
133
|
<div
|
134
|
[class]="'uk-margin-bottom uk-margin-medium-right '+(aggregationReadMore ? '' : 'lines-18 multi-line-ellipsis')">
|
135
|
<div>
|
136
|
OpenAIRE collects metadata records from a variety of content providers as described in
|
137
|
<a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>.
|
138
|
<br><br>
|
139
|
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content
|
140
|
providers compliant to the
|
141
|
<a href="https://guidelines.openaire.eu" target="_blank">OpenAIRE guidelines</a>
|
142
|
and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR,
|
143
|
re3data, DOAJ, and funder databases).
|
144
|
After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is
|
145
|
used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and
|
146
|
the
|
147
|
APIs.
|
148
|
<br><br>
|
149
|
The transformation process includes the application of cleaning functions whose goal is to ensure that
|
150
|
values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever
|
151
|
applicable,
|
152
|
to a common controlled vocabulary.
|
153
|
The controlled vocabularies used for cleansing are accessible at
|
154
|
<a href="http://api.openaire.eu/vocabularies" target="_blank">http://api.openaire.eu/vocabularies</a>.
|
155
|
Each vocabulary features a set of controlled terms, each with one code, one label, and a set of
|
156
|
synonyms.
|
157
|
If a synonym is found as field value, the value is updated with the corresponding term.
|
158
|
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that
|
159
|
are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref,
|
160
|
ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a
|
161
|
large set of links between research literature and data.
|
162
|
</div>
|
163
|
</div>
|
164
|
<div *ngIf="!aggregationReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
165
|
(click)="aggregationReadMore = true">
|
166
|
<a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
|
167
|
</div>
|
168
|
<div *ngIf="aggregationReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
169
|
(click)="aggregationReadMore = false">
|
170
|
<a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
|
171
|
</div>
|
172
|
<!-- </div>-->
|
173
|
<!-- <div class="uk-width-expand">-->
|
174
|
<!-- <img src="assets/graph-assets/about/architecture/aggregation.png">-->
|
175
|
<!-- </div>-->
|
176
|
</div>
|
177
|
</li>
|
178
|
<li>
|
179
|
<div class="uk-grid">
|
180
|
<!-- <div class="uk-width-3-5@m">-->
|
181
|
<div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
|
182
|
<ul class="uk-subnav button-tab" uk-switcher>
|
183
|
<li><a>Clustering</a></li>
|
184
|
<li><a>Matching & Election</a></li>
|
185
|
</ul>
|
186
|
|
187
|
<ul class="uk-switcher uk-margin align-list">
|
188
|
<li>
|
189
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
190
|
src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
|
191
|
<div
|
192
|
[class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(dedupClusteringReadMore ? '' : 'lines-18 multi-line-ellipsis')">
|
193
|
<div>
|
194
|
<div>
|
195
|
Clustering is a common heuristics used to overcome the N x N complexity required to match all
|
196
|
pairs of objects to identify the equivalent ones.
|
197
|
The challenge is to identify a clustering function that maximizes the chance of comparing only
|
198
|
records that may lead to a match, while minimizing the number of records that will not be
|
199
|
matched while being equivalent.
|
200
|
Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of
|
201
|
characters in the title, or minimal difference in letters), we need this function to be not
|
202
|
too
|
203
|
precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the
|
204
|
title).
|
205
|
On the other hand, reality tells us that in some cases equality of two records can only be
|
206
|
determined by their PIDs (e.g. DOI) as the metadata properties are very different across
|
207
|
different versions and no clustering function will ever bring them into the same cluster.
|
208
|
To match these requirements OpenAIRE clustering for products works with two functions:
|
209
|
</div>
|
210
|
|
211
|
<ul class="portal-circle">
|
212
|
<li>
|
213
|
<div>DOI: the function generates the DOI when this is provided as part of the record
|
214
|
properties;
|
215
|
</div>
|
216
|
</li>
|
217
|
<li>
|
218
|
<div>
|
219
|
Title-based function: the function generates a key that depends on
|
220
|
(i) number of significant words in the title (normalized, stemming, etc.),
|
221
|
(ii) module 10 of the number of characters of such words, and
|
222
|
(iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and
|
223
|
vice
|
224
|
versa) o the first 3 words (2 words if the title only has 2). For example, the title
|
225
|
“Entity
|
226
|
deduplication in big data graphs for scholarly communication” becomes “entity
|
227
|
deduplication
|
228
|
big data graphs scholarly communication” with two keys key “7.1entionbig” and
|
229
|
“7.1itydedbig”
|
230
|
(where 1 is module 10 of 54 characters of the normalized title.
|
231
|
</div>
|
232
|
</li>
|
233
|
</ul>
|
234
|
<div>
|
235
|
To give an idea, this configuration generates around 77Mi blocks, which we limited to 200
|
236
|
records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches in
|
237
|
a
|
238
|
block are performed using a “sliding window” set to 80 records. The records are sorted
|
239
|
lexicographically on a normalized version of their titles. The 1st record is matched against
|
240
|
all
|
241
|
the 80 following ones, then the second, etc. for an NlogN complexity.
|
242
|
</div>
|
243
|
</div>
|
244
|
</div>
|
245
|
<div *ngIf="!dedupClusteringReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
246
|
(click)="dedupClusteringReadMore = true">
|
247
|
<a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
|
248
|
</div>
|
249
|
<div *ngIf="dedupClusteringReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
250
|
(click)="dedupClusteringReadMore = false">
|
251
|
<a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
|
252
|
</div>
|
253
|
</li>
|
254
|
<li>
|
255
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
256
|
src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
|
257
|
<div
|
258
|
[class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(dedupMatchingAndElectionReadMore ? '' : 'lines-18 multi-line-ellipsis')">
|
259
|
<div>
|
260
|
<div>
|
261
|
Once the clusters have been built, the algorithm proceeds with the comparisons.
|
262
|
Comparisons are driven by a decisional tree that:
|
263
|
</div>
|
264
|
<ul class="uk-list">
|
265
|
<li class="uk-margin-small-bottom">
|
266
|
<div>
|
267
|
<span class="portal-color">1.</span> Tries to capture equivalence via PIDs: if records
|
268
|
share
|
269
|
a PID then they are equivalent
|
270
|
</div>
|
271
|
</li>
|
272
|
<li class="uk-margin-small-bottom">
|
273
|
<div>
|
274
|
<span class="portal-color">2.</span> Tries to capture difference:
|
275
|
</div>
|
276
|
<ul class="uk-list">
|
277
|
<li class="uk-margin-small-bottom">
|
278
|
<div>
|
279
|
<span class="portal-color">a.</span>
|
280
|
If record titles contain different “numbers” then they are different (this rule is
|
281
|
subject to different feelings, and should be fine-tuned);
|
282
|
</div>
|
283
|
</li>
|
284
|
<li class="uk-margin-small-bottom">
|
285
|
<div>
|
286
|
<span class="portal-color">b.</span>
|
287
|
If record contain different number of authors then they are different;
|
288
|
</div>
|
289
|
</li>
|
290
|
<li class="uk-margin-small-bottom">
|
291
|
<div>
|
292
|
<span class="portal-color">c.</span>
|
293
|
Note that different PIDs do not imply different records, as different versions may
|
294
|
have
|
295
|
different PIDs.
|
296
|
</div>
|
297
|
</li>
|
298
|
</ul>
|
299
|
</li>
|
300
|
<li>
|
301
|
<div><span class="portal-color">3.</span> Measures equivalence:</div>
|
302
|
<ul class="uk-list portal-circle">
|
303
|
<li>
|
304
|
<div>
|
305
|
The titles of the two records are normalised and compared for similarity by applying
|
306
|
the
|
307
|
Levenstein distance algorithm.
|
308
|
The algorithm returns a number in the range [0,1], where 0 means “very different” and
|
309
|
1
|
310
|
means “equal”.
|
311
|
If the distance is greater than or equal 0,99 the two records are identified as
|
312
|
duplicates.
|
313
|
</div>
|
314
|
</li>
|
315
|
<li>
|
316
|
<div>Dates are not regarded for equivalence matching because different versions of the
|
317
|
same records should be merged and may be published on different dates, e.g. pre-print
|
318
|
and published version of an article.
|
319
|
</div>
|
320
|
</li>
|
321
|
</ul>
|
322
|
</li>
|
323
|
</ul>
|
324
|
<div>
|
325
|
Once the equivalence relationships between pairs of records are set, the groups of equivalent
|
326
|
records are obtained (transitive closure, i.e. “mesh”).
|
327
|
From such sets a new representative object is obtained, which inherits all properties from the
|
328
|
merged records and keeps track of their provenance.
|
329
|
The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID
|
330
|
(given their lexicographical ordering).
|
331
|
A new, more stable function to generate the ID is under development, which exploits the DOI
|
332
|
when
|
333
|
one of the records to be merged includes a Crossref or a DataCite record.
|
334
|
</div>
|
335
|
</div>
|
336
|
</div>
|
337
|
<div *ngIf="!dedupMatchingAndElectionReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
338
|
(click)="dedupMatchingAndElectionReadMore = true">
|
339
|
<a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
|
340
|
</div>
|
341
|
<div *ngIf="dedupMatchingAndElectionReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
342
|
(click)="dedupMatchingAndElectionReadMore = false">
|
343
|
<a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
|
344
|
</div>
|
345
|
</li>
|
346
|
</ul>
|
347
|
</div>
|
348
|
<!-- </div>-->
|
349
|
<!-- <div class="uk-width-expand">-->
|
350
|
<!-- <img src="assets/graph-assets/about/architecture/deduplication.svg">-->
|
351
|
<!-- </div>-->
|
352
|
</div>
|
353
|
</li>
|
354
|
<li>
|
355
|
<div class="uk-grid">
|
356
|
<!-- <div class="uk-width-3-5@m">-->
|
357
|
<div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
|
358
|
<ul class="uk-subnav button-tab uk-grid uk-grid-small" uk-switcher>
|
359
|
<li><a>General</a></li>
|
360
|
<li><a>Mining</a></li>
|
361
|
<li><a>Bulk tagging/ Deduction</a></li>
|
362
|
<li><a>Propagation</a></li>
|
363
|
</ul>
|
364
|
|
365
|
<ul class="uk-switcher uk-margin">
|
366
|
<li>
|
367
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
368
|
src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
|
369
|
<div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
|
370
|
<p>
|
371
|
The aggregation processes are continuously running and apply vocabularies as they are in a given
|
372
|
moment of time.
|
373
|
It could be the case that a vocabulary changes after the aggregation of one data source has
|
374
|
finished,
|
375
|
thus the aggregated content does not reflect the current status of the controlled vocabularies.
|
376
|
<br><br>
|
377
|
In addition, the integration of ScholeXplorer and DOIBooost and some enrichment processes
|
378
|
applied
|
379
|
on the raw
|
380
|
and on the de-duplicated graph may introduce values that do not comply with the current status
|
381
|
of
|
382
|
the OpenAIRE controlled vocabularies.
|
383
|
For these reasons, we included a final step of cleansing at the end of the workflow
|
384
|
materialisation.
|
385
|
The output of the final cleansing step is the final version of the OpenAIRE Research Graph.
|
386
|
</p>
|
387
|
</div>
|
388
|
</li>
|
389
|
<li>
|
390
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
391
|
src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
|
392
|
<div
|
393
|
[class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(enrichmentMiningReadMore ? '' : 'lines-18 multi-line-ellipsis')">
|
394
|
<div>
|
395
|
<div>
|
396
|
The OpenAIRE Research Graph is enriched by links mined by OpenAIRE’s full-text mining
|
397
|
algorithms
|
398
|
that scan the plaintexts of publications for funding information, references to datasets,
|
399
|
software URIs, accession numbers of bioetities, and EPO patent mentions.
|
400
|
Custom mining modules also link research objects to specific research communities, initiatives
|
401
|
and infrastructures.
|
402
|
In addition, other inference modules provide content-based document classification, document
|
403
|
similarity, citation matching, and author affiliation matching.
|
404
|
<br><br>
|
405
|
<span class="portal-color">Project mining</span>
|
406
|
in OpenAIRE text mines the full-texts of publications in order to extract matches to funding
|
407
|
project codes/IDs.
|
408
|
The mining algorithm works by utilising
|
409
|
(i) the grant identifier, and
|
410
|
(ii) the project acronym (if available) of each project.
|
411
|
The mining algorithm:
|
412
|
(1) Preprocesses/normalizes the full-texts using several functions, which depend on the
|
413
|
characteristics of each funder (i.e., the format of the grant identifiers), such as stopword
|
414
|
and/or punctuation removal, tokenization, stemming, converting to lowercase; then
|
415
|
(2) String matching of grant identifiers against the normalized text is done using database
|
416
|
techniques; and
|
417
|
(3) The results are validated and cleaned using the context near the match by looking at the
|
418
|
context around the matched ID for relevant metadata and positive or negative words/phrases, in
|
419
|
order to calculate a confidence value for each publication-->project link.
|
420
|
A confidence threshold is set to optimise high accuracy while minimising false positives, such
|
421
|
as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or
|
422
|
URLs, accession numbers.
|
423
|
The algorithm also applies rules for disambiguating results, as different funders can share
|
424
|
identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix
|
425
|
but
|
426
|
also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging
|
427
|
techniques to measure the neurobiological effects of sleep apnea”.
|
428
|
Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE.
|
429
|
Performance results vary from funder to funder but precision is higher than 98% for all
|
430
|
funders
|
431
|
and 99.5% for EC projects.
|
432
|
Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using
|
433
|
project/grant IDs.
|
434
|
<br><br>
|
435
|
<span class="portal-color">Dataset extraction</span>
|
436
|
runs on publications full-texts as described in “High pass text-filtering for Citation
|
437
|
matching”, TPDL 2017[1].
|
438
|
In particular, we search for citations to datasets using their DOIs, titles and other metadata
|
439
|
(i.e., dates, creator names, publishers, etc.).
|
440
|
We extract parts of the text which look like citations and search for datasets using database
|
441
|
join and pattern matching techniques.
|
442
|
Based on the experiments described in the paper, precision of the dataset extraction module is
|
443
|
98.5% and recall is 97.4% but it is also probably overestimated since it does not take into
|
444
|
account corruptions that may take place during pdf to text extraction.
|
445
|
It is calculated on the extracted full-texts of small samples from PubMed and arXiv.
|
446
|
<br><br>
|
447
|
<span class="portal-color">Software extraction</span>
|
448
|
runs also on parts of the text which look like citations.
|
449
|
We search the citations for links to software in open software repositories, specifically
|
450
|
github, sourceforge, bitbucket and the google code archive.
|
451
|
After that, we search for links that are included in Software Heritage (SH,
|
452
|
https://www.softwareheritage.org) and return the permanent URL that SH provides for each
|
453
|
software project.
|
454
|
We also enrich this content with user names, titles and descriptions of the software projects
|
455
|
using web mining techniques.
|
456
|
Since software mining is based on URL matching, our precision is 100% (we return a software
|
457
|
link
|
458
|
only if we find it in the text and there is no need to disambiguate).
|
459
|
As for recall rate, this is not calculable for this mining task.
|
460
|
Although we apply all the necessary normalizations to the URLs in order to overcome usual
|
461
|
issues
|
462
|
(e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases
|
463
|
where a software is mentioned using its name and not by a link from the supported software
|
464
|
repositories.
|
465
|
<br><br>
|
466
|
<span class="portal-color">For the extraction of bio-entities</span>, we focus on Protein Data
|
467
|
Bank (PDB) entries.
|
468
|
We have downloaded the database with PDB codes and we update it regularly.
|
469
|
We search through the whole publication’s full-text for references to PDB codes.
|
470
|
We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes
|
471
|
or
|
472
|
other issues) so that we return valid results.
|
473
|
Current precision is 98%.
|
474
|
Although it's risky to mention recall rates since these are usually overestimated, we have
|
475
|
calculated a recall rate of 98% using small samples from pubmed publications.
|
476
|
Moreover, our technique is able to identify about 30% more links to proteins than the ones
|
477
|
that
|
478
|
are tagged in Pubmed xmls.
|
479
|
<br><br>
|
480
|
<span class="portal-color">Other text-mining modules</span> include mining for links to EPO
|
481
|
patents, or custom mining modules for linking research objects to specific research
|
482
|
communities,
|
483
|
initiatives and infrastructures, e.g. COVID-19 mining module.
|
484
|
Apart from text-mining modules, OpenAIRE also provides a document classification service that
|
485
|
employs analysis of free text stemming from the abstracts of the publications.
|
486
|
The purpose of applying a document classification module is to assign a scientific text one or
|
487
|
more predefined content classes.
|
488
|
In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and
|
489
|
DDC (Dewey Decimal Classification, or Dewey Decimal System).
|
490
|
<br><br>
|
491
|
<hr>
|
492
|
[1] Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., & Ioannidis, Y. (2017, September).
|
493
|
High-Pass Text Filtering for Citation Matching.
|
494
|
In International Conference on Theory and Practice of Digital Libraries (pp. 355-366).
|
495
|
Springer,
|
496
|
Cham.
|
497
|
</div>
|
498
|
</div>
|
499
|
</div>
|
500
|
<div *ngIf="!enrichmentMiningReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
501
|
(click)="enrichmentMiningReadMore = true">
|
502
|
<a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
|
503
|
</div>
|
504
|
<div *ngIf="enrichmentMiningReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
505
|
(click)="enrichmentMiningReadMore = false">
|
506
|
<a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
|
507
|
</div>
|
508
|
</li>
|
509
|
<li>
|
510
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
511
|
src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
|
512
|
<div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
|
513
|
The Deduction process (also known as “bulk tagging”) enriches each record with new information
|
514
|
that
|
515
|
can be derived from the existing property values.
|
516
|
<br><br>
|
517
|
As of September 2020, three procedures are in place to relate a research product to a research
|
518
|
initiative, infrastructure (RI) or community (RC) based on:
|
519
|
<ul class="portal-circle">
|
520
|
<li>subjects (2.7M results tagged)</li>
|
521
|
<li>Zenodo community (16K results tagged)</li>
|
522
|
<li>the data source it comes from (250K results tagged)</li>
|
523
|
</ul>
|
524
|
The list of subjects, Zenodo communities and data sources used to enrich the products are defined
|
525
|
by
|
526
|
the managers of the community gateway or infrastructure monitoring dashboard associated with the
|
527
|
RC/RI.
|
528
|
</div>
|
529
|
</li>
|
530
|
<li>
|
531
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
532
|
src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
|
533
|
<div
|
534
|
[class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(enrichmentPropagationReadMore ? '' : 'lines-18 multi-line-ellipsis')">
|
535
|
<div>
|
536
|
<div>
|
537
|
This process “propagates” properties and links from one product to another if between the two
|
538
|
there is a “strong” semantic relationship.
|
539
|
<br><br>
|
540
|
As of September 2020, the following procedures are in place:
|
541
|
<ul class="portal-circle">
|
542
|
<li>
|
543
|
Propagation of the property “country” to results from institutional repositories:
|
544
|
e.g. publication collected from an institutional repository maintained by an italian
|
545
|
university will be enriched with the property “country = IT”.
|
546
|
</li>
|
547
|
<li>
|
548
|
Propagation of links to projects: e.g. publication linked to project P “is supplemented
|
549
|
by”
|
550
|
a dataset D.
|
551
|
Dataset D will get the link to project P.
|
552
|
The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
|
553
|
</li>
|
554
|
<li>
|
555
|
Propagation of related community/infrastructure/initiative from organizations to products
|
556
|
via affiliation relationships: e.g. a publication with an author affiliated with
|
557
|
organization O.
|
558
|
The manager of the community gateway C declared that the outputs of O are all relevant for
|
559
|
his/her community C.
|
560
|
The publication is tagged as relevant for C.
|
561
|
</li>
|
562
|
<li>
|
563
|
Propagation of related community/infrastructure/initiative to related products: e.g.
|
564
|
publication associated to community C is supplemented by a dataset D.
|
565
|
Dataset D will get the association to C.
|
566
|
The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
|
567
|
</li>
|
568
|
<li>
|
569
|
Propagation of ORCID identifiers to related products, if the products have the same
|
570
|
authors:
|
571
|
e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D
|
572
|
has
|
573
|
the same authors as the publication. Authors of D are enriched with the ORCIDs available
|
574
|
in
|
575
|
the publication.
|
576
|
The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
|
577
|
</li>
|
578
|
</ul>
|
579
|
</div>
|
580
|
</div>
|
581
|
</div>
|
582
|
<div *ngIf="!enrichmentPropagationReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
583
|
(click)="enrichmentPropagationReadMore = true">
|
584
|
<a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
|
585
|
</div>
|
586
|
<div *ngIf="enrichmentPropagationReadMore" class="uk-width-3-5@m uk-text-center clickable"
|
587
|
(click)="enrichmentPropagationReadMore = false">
|
588
|
<a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
|
589
|
</div>
|
590
|
</li>
|
591
|
</ul>
|
592
|
</div>
|
593
|
<!-- </div>-->
|
594
|
<!-- <div class="uk-width-expand">-->
|
595
|
<!-- <img src="assets/graph-assets/about/architecture/enrichment.svg">-->
|
596
|
<!-- </div>-->
|
597
|
</div>
|
598
|
</li>
|
599
|
<li>
|
600
|
<div class="uk-text-small uk-margin-large-top">
|
601
|
<!-- <div class="uk-width-3-5@m">-->
|
602
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
603
|
src="assets/graph-assets/about/architecture/post_cleaning.svg" alt="Post Cleaning">
|
604
|
<div class="uk-margin-bottom uk-margin-medium-right">
|
605
|
<p>
|
606
|
Lorem ipsum...
|
607
|
</p>
|
608
|
</div>
|
609
|
<!-- </div>-->
|
610
|
<!-- <div class="uk-width-expand">-->
|
611
|
<!-- <img src="assets/graph-assets/about/architecture/post_cleaning.svg">-->
|
612
|
<!-- </div>-->
|
613
|
</div>
|
614
|
</li>
|
615
|
<li>
|
616
|
<div class="uk-text-small uk-margin-large-top">
|
617
|
<!-- <div class="uk-width-3-5@m">-->
|
618
|
<img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
|
619
|
src="assets/graph-assets/about/architecture/indexing.svg" alt="Indexing">
|
620
|
<div class="uk-margin-bottom uk-margin-medium-right">
|
621
|
<p>
|
622
|
The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the
|
623
|
OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party
|
624
|
applications and organizations, such as:
|
625
|
</p>
|
626
|
<ul class="portal-circle">
|
627
|
<li class="uk-margin-small-bottom">
|
628
|
<span class="portal-color">EOSC</span>
|
629
|
--The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource
|
630
|
Catalogue, keeping an up to date map of all research results (publications, datasets, software),
|
631
|
services, organizations, projects, funders in Europe and beyond.
|
632
|
</li>
|
633
|
<li class="uk-margin-small-bottom">
|
634
|
<span class="portal-color">DSpace & EPrints</span>
|
635
|
repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their
|
636
|
OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding
|
637
|
project, by selecting it from the list of project provided by OpenAIRE
|
638
|
</li>
|
639
|
<li>
|
640
|
<span class="portal-color">EC participant portal (Sygma - System for Grant Management)</span>
|
641
|
uses the OpenAIRE API in the “Continuous Reporting” section.
|
642
|
Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in
|
643
|
the
|
644
|
OpenAIRE Research Graph that are linked to the project.
|
645
|
The user can select the research products from the list and easily compile the continuous reporting
|
646
|
data of the project.
|
647
|
</li>
|
648
|
</ul>
|
649
|
</div>
|
650
|
<!-- </div>-->
|
651
|
<!-- <div class="uk-width-expand">-->
|
652
|
<!-- <img src="assets/graph-assets/about/architecture/indexing.svg">-->
|
653
|
<!-- </div>-->
|
654
|
</div>
|
655
|
</li>
|
656
|
<li>
|
657
|
<div class="uk-text-small uk-margin-large-top">
|
658
|
<!-- <div class="uk-width-3-5@m">-->
|
659
|
<img
|
660
|
class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image uk-padding-large uk-padding-remove-top uk-padding-remove-horizontal"
|
661
|
src="assets/graph-assets/about/architecture/stats_analysis.svg" alt="Stats Analysis">
|
662
|
<div class="uk-margin-bottom uk-margin-medium-right">
|
663
|
<p>
|
664
|
The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and
|
665
|
producing
|
666
|
the charts for funders, research initiative, infrastructures, and policy makers that you can see on
|
667
|
MONITOR.
|
668
|
Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring
|
669
|
the funding and research impact and the uptake of Open Science publishing practices,
|
670
|
such as Open Access publishing of publications and datasets, availability of interlinks between
|
671
|
research
|
672
|
products, availability of post-print versions in institutional or thematic Open Access repositories,
|
673
|
etc.
|
674
|
</p>
|
675
|
</div>
|
676
|
<!-- </div>-->
|
677
|
<!-- <div class="uk-width-expand">-->
|
678
|
<!-- <img src="assets/graph-assets/about/architecture/stats_analysis.svg">-->
|
679
|
<!-- </div>-->
|
680
|
</div>
|
681
|
</li>
|
682
|
</ul>
|
683
|
</div>
|
684
|
</div>
|
685
|
<div class="uk-padding-small uk-margin-top">
|
686
|
<h6>References</h6>
|
687
|
<ul class="uk-text-small portal-circle">
|
688
|
<li>
|
689
|
<a href="https://aka.ms/msracad" target="_blank">Microsoft Academic Graph</a>
|
690
|
which is made available under the ODC Attribution License.
|
691
|
For more information on Microsoft Academic Graph please also read
|
692
|
<a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq" target="_blank">here</a>.
|
693
|
</li>
|
694
|
<li>
|
695
|
<a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>
|
696
|
</li>
|
697
|
</ul>
|
698
|
</div>
|
699
|
</div>
|
700
|
</div>
|
701
|
<div id="metrics" class="uk-container uk-container-large uk-section">
|
702
|
<div class="uk-padding-small">
|
703
|
<h2 class="uk-text-center">Data & Metrics</h2>
|
704
|
<h4 class="uk-text-center uk-margin-medium-top portal-color">Coming soon...</h4>
|
705
|
<!-- <div>-->
|
706
|
<!-- <h3 class="uk-margin-medium-top portal-color">Data</h3>-->
|
707
|
<!-- <div></div>-->
|
708
|
<!-- </div>-->
|
709
|
<!-- <div>-->
|
710
|
<!-- <h3 class="uk-margin-medium-top portal-color">Metrics</h3>-->
|
711
|
<!-- <div></div>-->
|
712
|
<!-- </div>-->
|
713
|
</div>
|
714
|
</div>
|
715
|
<div id="infrastructure" class="uk-container uk-container-large uk-section">
|
716
|
<div class="uk-padding-small">
|
717
|
<h2 class="uk-text-center">Infrastructure</h2>
|
718
|
<div>
|
719
|
<div class="uk-flex uk-flex-center">
|
720
|
<p class="uk-width-3-4@m uk-padding-small">
|
721
|
The OpenAIRE graph operates based on a vast variety of hardware and software. As of December 2019, the
|
722
|
hardware infrastructure is the following:
|
723
|
</p>
|
724
|
</div>
|
725
|
<img src="assets/graph-assets/about/infrastructure.png">
|
726
|
</div>
|
727
|
</div>
|
728
|
</div>
|
729
|
<div id="team" class="uk-container uk-container-large uk-section">
|
730
|
<div class="uk-padding-small">
|
731
|
<h2 class="uk-text-center">Team</h2>
|
732
|
<div>
|
733
|
<div class="uk-margin-bottom">
|
734
|
<div class="uk-flex uk-flex-middle uk-grid" uk-grid="">
|
735
|
<div class="uk-text-center uk-width-1-1@s uk-width-1-3@m uk-first-column">
|
736
|
<img src="assets/graph-assets/about/team.svg">
|
737
|
</div>
|
738
|
|
739
|
<div class="uk-text-center">
|
740
|
<div class="uk-margin-medium-bottom">
|
741
|
Key team members contributing to the Research Graph
|
742
|
</div>
|
743
|
<div>
|
744
|
<a class="uk-button portal-button" target="_blank" href="https://www.openaire.eu/research-graph-team">
|
745
|
Meet the team
|
746
|
<span class="space">
|
747
|
<svg height="16" viewBox="0 0 16 16" width="16" xmlns="http://www.w3.org/2000/svg">
|
748
|
<path
|
749
|
d="M12.578,4.244,5.667,11.155V8.167A.833.833,0,1,0,4,8.167v5A.832.832,0,0,0,4.833,14h5a.833.833,0,0,0,0-1.667H6.845l6.911-6.911a.833.833,0,1,0-1.178-1.178h0Z"
|
750
|
fill="#fff" id="arrow-down-left2" transform="translate(7.071 19.799) rotate(-135)">
|
751
|
</path>
|
752
|
</svg>
|
753
|
</span>
|
754
|
</a>
|
755
|
</div>
|
756
|
</div>
|
757
|
</div>
|
758
|
</div>
|
759
|
</div>
|
760
|
</div>
|
761
|
</div>
|
762
|
</div>
|