Project

General

Profile

1
<div class="about">
2
  <div class="uk-section">
3
    <div class="uk-margin-large-left uk-margin-medium-bottom">
4
      <breadcrumbs [breadcrumbs]="breadcrumbs"></breadcrumbs>
5
    </div>
6
    <div class="firstBackground">
7
      <div class="uk-container">
8
        <h2 class="uk-text-center">About</h2>
9
        <div class="uk-flex uk-flex-center">
10
          <div class="uk-padding-small uk-width-4-5@m">
11
            <p>
12
              Open Science is gradually becoming the modus operandi in research practices, affecting the way researchers
13
              collaborate and publish, discover, and access scientific knowledge.
14
              Scientists are increasingly publishing research results beyond the article, to share all scientific
15
              products (metadata and files) generated during an experiment, such as datasets, software, experiments.
16
              They publish in scholarly communication data sources (e.g. institutional repositories, data archives,
17
              software repositories), rely where possible on persistent identifiers (e.g. DOI, ORCID, Grid.ac, PDBs),
18
              specify semantic links to other research products (e.g. supplementedBy, citedBy, versionOf), and possibly
19
              to projects and/or relative funders.
20
              By following such practices, scientists are implicitly constructing the Global Open Science Graph, where
21
              by "graph" we mean a collection of objects interlinked by semantic relationships.
22
              <br><br>
23
              The OpenAIRE Research Graph includes metadata and links between scientific products (e.g. literature,
24
              datasets, software, and "other research products"), organizations, funders, funding streams, projects,
25
              communities, and (provenance) data sources - the details of the <a
26
                href="https://zenodo.org/record/2643199#.XOqdstMzZ24" target="_blank">graph data model</a> can be found
27
              in Zenodo.org.
28
              <br><br>
29
              The Graph is available and obtained as an aggregation of the metadata and links collected from ~70.000
30
              trusted sources, further enriched with metadata and links provided by:</p>
31
            <ul class="portal-circle">
32
              <li class="uk-margin-bottom">OpenAIRE end-users, e.g. researchers, project administrators, data curators
33
                providing links from scientific products to projects, funders, communities, or other products;
34
              </li>
35
              <li class="uk-margin-bottom">OpenAIRE Full-text mining algorithms over around ~10Mi Open Access article
36
                full-texts;
37
              </li>
38
              <li>Research infrastructure scholarly services, bridged to the graph via OpenAIRE, exposing metadata of
39
                products such as research workflows, experiments, research objects, software, etc..
40
              </li>
41
            </ul>
42
          </div>
43
        </div>
44
      </div>
45
    </div>
46
  </div>
47
  <div id="architecture" class="uk-container uk-section">
48
    <div class="uk-padding-small">
49
      <h2 class="uk-text-center">Architecture</h2>
50
      <div class="uk-flex uk-flex-center">
51
        <div class="uk-width-4-5@m">
52
          <h3 class="uk-margin-medium-top portal-color">How we build it</h3>
53
          <div>
54
            <p>
55
              OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the
56
              world, including Open Access institutional repositories, data archives, journals.
57
              All the metadata records (i.e. descriptions of research products) are put together in a data lake,
58
              together
59
              with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national
60
              and international funders.
61
              Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications
62
              enrich
63
              the content of the data lake with links between research results and projects, author affiliations,
64
              subject
65
              classification, links to entries from domain-specific databases.
66
              Duplicated organisations and results are identified and merged together to obtain an open, trusted, public
67
              resource enabling explorations of the scholarly communication landscape like never before.
68
            </p>
69

    
70
          </div>
71
        </div>
72
      </div>
73
      <div class="uk-flex uk-flex-center uk-inline uk-margin-medium-top">
74
        <img [src]="'assets/graph-assets/about/architecture/'+architectureImage"
75
             class="uk-width-4-5 architecture-image">
76

    
77
        <a class="uk-position-absolute uk-transform-center" style="left: 27%; top: 48%"
78
           (click)="goTo('tabs_card'); changeTab(0)"
79
           (mouseenter)="architectureImage = 'aggregation_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
80
          <img [class]="(architectureImage == 'aggregation_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
81
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point aggregation">
82
        </a>
83
        <a class="uk-position-absolute uk-transform-center" style="left: 47%; top: 48%"
84
           (click)="goTo('tabs_card'); changeTab(1)"
85
           (mouseenter)="architectureImage = 'deduplication_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
86
          <img [class]="(architectureImage == 'deduplication_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
87
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point deduplication">
88
        </a>
89
        <a class="uk-position-absolute uk-transform-center" style="left: 58%; top: 48%"
90
           (click)="goTo('tabs_card'); changeTab(2)"
91
           (mouseenter)="architectureImage = 'enrichment_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
92
          <img [class]="(architectureImage == 'enrichment_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
93
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point enrichment">
94
        </a>
95
        <a class="uk-position-absolute uk-transform-center" style="left: 70%; top: 48%"
96
           (click)="goTo('tabs_card'); changeTab(3)"
97
           (mouseenter)="architectureImage = 'post_cleaning_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
98
          <img [class]="(architectureImage == 'post_cleaning_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
99
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point post cleaning">
100
        </a>
101
        <a class="uk-position-absolute uk-transform-center" style="left: 76%; top: 32%"
102
           (click)="goTo('tabs_card'); changeTab(4)"
103
           (mouseenter)="architectureImage = 'indexing_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
104
          <img [class]="(architectureImage == 'indexing_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
105
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point indexing">
106
        </a>
107
        <a class="uk-position-absolute uk-transform-center" style="left: 76%; top: 72%"
108
           (click)="goTo('tabs_card'); changeTab(5)"
109
           (mouseenter)="architectureImage = 'stats_analysis_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
110
          <img [class]="(architectureImage == 'stats_analysis_hover.png' ? 'uk-invisible' : '')+' marker-gif'"
111
               src="assets/graph-assets/about/architecture/marker.gif" alt="action point stats analysis">
112
        </a>
113
      </div>
114
      <div id="tabs_card"
115
           class="uk-margin-xlarge-top uk-padding-small">
116
        <div class="uk-card uk-card-default uk-card-body architecture-card">
117
          <ul #tabs uk-tab class="uk-tab">
118
            <li><a>Aggregation</a></li>
119
            <li><a>Deduplication</a></li>
120
            <li><a>Enrichment</a></li>
121
            <li><a>Post-Cleaning</a></li>
122
            <li><a>Indexing</a></li>
123
            <li><a>Stats Analysis</a></li>
124
          </ul>
125

    
126
          <ul class="uk-switcher uk-margin">
127
            <li>
128
              <!--            uk-grid-->
129
              <div class=" uk-margin-large-top uk-text-small">
130
                <!--              <div class="uk-width-3-5@m">-->
131
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
132
                     src="assets/graph-assets/about/architecture/aggregation.png" alt="Aggregation">
133
                <div
134
                    [class]="'uk-margin-bottom uk-margin-medium-right '+(aggregationReadMore ? '' : 'lines-18 multi-line-ellipsis')">
135
                  <div>
136
                    OpenAIRE collects metadata records from a variety of content providers as described in
137
                    <a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>.
138
                    <br><br>
139
                    OpenAIRE aggregates metadata records describing objects of the research life-cycle from content
140
                    providers compliant to the
141
                    <a href="https://guidelines.openaire.eu" target="_blank">OpenAIRE guidelines</a>
142
                    and from entity registries (i.e. data sources offering authoritative lists of entities, like OpenDOAR,
143
                    re3data, DOAJ, and funder databases).
144
                    After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is
145
                    used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and
146
                    the
147
                    APIs.
148
                    <br><br>
149
                    The transformation process includes the application of cleaning functions whose goal is to ensure that
150
                    values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever
151
                    applicable,
152
                    to a common controlled vocabulary.
153
                    The controlled vocabularies used for cleansing are accessible at
154
                    <a href="http://api.openaire.eu/vocabularies" target="_blank">http://api.openaire.eu/vocabularies</a>.
155
                    Each vocabulary features a set of controlled terms, each with one code, one label, and a set of
156
                    synonyms.
157
                    If a synonym is found as field value, the value is updated with the corresponding term.
158
                    Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that
159
                    are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref,
160
                    ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering a
161
                    large set of links between research literature and data.
162
                  </div>
163
                </div>
164
                <div *ngIf="!aggregationReadMore" class="uk-width-3-5@m uk-text-center clickable"
165
                     (click)="aggregationReadMore = true">
166
                  <a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
167
                </div>
168
                <div *ngIf="aggregationReadMore" class="uk-width-3-5@m uk-text-center clickable"
169
                     (click)="aggregationReadMore = false">
170
                  <a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
171
                </div>
172
                <!--              </div>-->
173
                <!--              <div class="uk-width-expand">-->
174
                <!--                <img src="assets/graph-assets/about/architecture/aggregation.png">-->
175
                <!--              </div>-->
176
              </div>
177
            </li>
178
            <li>
179
              <div class="uk-grid">
180
                <!--              <div class="uk-width-3-5@m">-->
181
                <div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
182
                  <ul class="uk-subnav button-tab" uk-switcher>
183
                    <li><a>Clustering</a></li>
184
                    <li><a>Matching & Election</a></li>
185
                  </ul>
186

    
187
                  <ul class="uk-switcher uk-margin align-list">
188
                    <li>
189
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
190
                           src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
191
                      <div
192
                          [class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(dedupClusteringReadMore ? '' : 'lines-18 multi-line-ellipsis')">
193
                        <div>
194
                          <div>
195
                            Clustering is a common heuristics used to overcome the N x N complexity required to match all
196
                            pairs of objects to identify the equivalent ones.
197
                            The challenge is to identify a clustering function that maximizes the chance of comparing only
198
                            records that may lead to a match, while minimizing the number of records that will not be
199
                            matched while being equivalent.
200
                            Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of
201
                            characters in the title, or minimal difference in letters), we need this function to be not
202
                            too
203
                            precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the
204
                            title).
205
                            On the other hand, reality tells us that in some cases equality of two records can only be
206
                            determined by their PIDs (e.g. DOI) as the metadata properties are very different across
207
                            different versions and no clustering function will ever bring them into the same cluster.
208
                            To match these requirements OpenAIRE clustering for products works with two functions:
209
                          </div>
210

    
211
                          <ul class="portal-circle">
212
                            <li>
213
                              <div>DOI: the function generates the DOI when this is provided as part of the record
214
                                properties;
215
                              </div>
216
                            </li>
217
                            <li>
218
                              <div>
219
                                Title-based function: the function generates a key that depends on
220
                                (i) number of significant words in the title (normalized, stemming, etc.),
221
                                (ii) module 10 of the number of characters of such words, and
222
                                (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and
223
                                vice
224
                                versa) o the first 3 words (2 words if the title only has 2). For example, the title
225
                                “Entity
226
                                deduplication in big data graphs for scholarly communication” becomes “entity
227
                                deduplication
228
                                big data graphs scholarly communication” with two keys key “7.1entionbig” and
229
                                “7.1itydedbig”
230
                                (where 1 is module 10 of 54 characters of the normalized title.
231
                              </div>
232
                            </li>
233
                          </ul>
234
                          <div>
235
                            To give an idea, this configuration generates around 77Mi blocks, which we limited to 200
236
                            records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches in
237
                            a
238
                            block are performed using a “sliding window” set to 80 records. The records are sorted
239
                            lexicographically on a normalized version of their titles. The 1st record is matched against
240
                            all
241
                            the 80 following ones, then the second, etc. for an NlogN complexity.
242
                          </div>
243
                        </div>
244
                      </div>
245
                      <div *ngIf="!dedupClusteringReadMore" class="uk-width-3-5@m uk-text-center clickable"
246
                           (click)="dedupClusteringReadMore = true">
247
                        <a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
248
                      </div>
249
                      <div *ngIf="dedupClusteringReadMore" class="uk-width-3-5@m uk-text-center clickable"
250
                           (click)="dedupClusteringReadMore = false">
251
                        <a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
252
                      </div>
253
                    </li>
254
                    <li>
255
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
256
                           src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
257
                      <div
258
                          [class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(dedupMatchingAndElectionReadMore ? '' : 'lines-18 multi-line-ellipsis')">
259
                        <div>
260
                          <div>
261
                            Once the clusters have been built, the algorithm proceeds with the comparisons.
262
                            Comparisons are driven by a decisional tree that:
263
                          </div>
264
                          <ul class="uk-list">
265
                            <li class="uk-margin-small-bottom">
266
                              <div>
267
                                <span class="portal-color">1.</span> Tries to capture equivalence via PIDs: if records
268
                                share
269
                                a PID then they are equivalent
270
                              </div>
271
                            </li>
272
                            <li class="uk-margin-small-bottom">
273
                              <div>
274
                                <span class="portal-color">2.</span> Tries to capture difference:
275
                              </div>
276
                              <ul class="uk-list">
277
                                <li class="uk-margin-small-bottom">
278
                                  <div>
279
                                    <span class="portal-color">a.</span>
280
                                    If record titles contain different “numbers” then they are different (this rule is
281
                                    subject to different feelings, and should be fine-tuned);
282
                                  </div>
283
                                </li>
284
                                <li class="uk-margin-small-bottom">
285
                                  <div>
286
                                    <span class="portal-color">b.</span>
287
                                    If record contain different number of authors then they are different;
288
                                  </div>
289
                                </li>
290
                                <li class="uk-margin-small-bottom">
291
                                  <div>
292
                                    <span class="portal-color">c.</span>
293
                                    Note that different PIDs do not imply different records, as different versions may
294
                                    have
295
                                    different PIDs.
296
                                  </div>
297
                                </li>
298
                              </ul>
299
                            </li>
300
                            <li>
301
                              <div><span class="portal-color">3.</span> Measures equivalence:</div>
302
                              <ul class="uk-list portal-circle">
303
                                <li>
304
                                  <div>
305
                                    The titles of the two records are normalised and compared for similarity by applying
306
                                    the
307
                                    Levenstein distance algorithm.
308
                                    The algorithm returns a number in the range [0,1], where 0 means “very different” and
309
                                    1
310
                                    means “equal”.
311
                                    If the distance is greater than or equal 0,99 the two records are identified as
312
                                    duplicates.
313
                                  </div>
314
                                </li>
315
                                <li>
316
                                  <div>Dates are not regarded for equivalence matching because different versions of the
317
                                    same records should be merged and may be published on different dates, e.g. pre-print
318
                                    and published version of an article.
319
                                  </div>
320
                                </li>
321
                              </ul>
322
                            </li>
323
                          </ul>
324
                          <div>
325
                            Once the equivalence relationships between pairs of records are set, the groups of equivalent
326
                            records are obtained (transitive closure, i.e. “mesh”).
327
                            From such sets a new representative object is obtained, which inherits all properties from the
328
                            merged records and keeps track of their provenance.
329
                            The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID
330
                            (given their lexicographical ordering).
331
                            A new, more stable function to generate the ID is under development, which exploits the DOI
332
                            when
333
                            one of the records to be merged includes a Crossref or a DataCite record.
334
                          </div>
335
                        </div>
336
                      </div>
337
                      <div *ngIf="!dedupMatchingAndElectionReadMore" class="uk-width-3-5@m uk-text-center clickable"
338
                           (click)="dedupMatchingAndElectionReadMore = true">
339
                        <a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
340
                      </div>
341
                      <div *ngIf="dedupMatchingAndElectionReadMore" class="uk-width-3-5@m uk-text-center clickable"
342
                           (click)="dedupMatchingAndElectionReadMore = false">
343
                        <a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
344
                      </div>
345
                    </li>
346
                  </ul>
347
                </div>
348
                <!--              </div>-->
349
                <!--              <div class="uk-width-expand">-->
350
                <!--                <img src="assets/graph-assets/about/architecture/deduplication.svg">-->
351
                <!--              </div>-->
352
              </div>
353
            </li>
354
            <li>
355
              <div class="uk-grid">
356
                <!--              <div class="uk-width-3-5@m">-->
357
                <div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
358
                  <ul class="uk-subnav button-tab uk-grid uk-grid-small" uk-switcher>
359
                    <li><a>General</a></li>
360
                    <li><a>Mining</a></li>
361
                    <li><a>Bulk tagging/ Deduction</a></li>
362
                    <li><a>Propagation</a></li>
363
                  </ul>
364

    
365
                  <ul class="uk-switcher uk-margin">
366
                    <li>
367
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
368
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
369
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
370
                        <p>
371
                          The aggregation processes are continuously running and apply vocabularies as they are in a given
372
                          moment of time.
373
                          It could be the case that a vocabulary changes after the aggregation of one data source has
374
                          finished,
375
                          thus the aggregated content does not reflect the current status of the controlled vocabularies.
376
                          <br><br>
377
                          In addition, the integration of ScholeXplorer and DOIBooost and some enrichment processes
378
                          applied
379
                          on the raw
380
                          and on the de-duplicated graph may introduce values that do not comply with the current status
381
                          of
382
                          the OpenAIRE controlled vocabularies.
383
                          For these reasons, we included a final step of cleansing at the end of the workflow
384
                          materialisation.
385
                          The output of the final cleansing step is the final version of the OpenAIRE Research Graph.
386
                        </p>
387
                      </div>
388
                    </li>
389
                    <li>
390
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
391
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
392
                      <div
393
                          [class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(enrichmentMiningReadMore ? '' : 'lines-18 multi-line-ellipsis')">
394
                        <div>
395
                          <div>
396
                            The OpenAIRE Research Graph is enriched by links mined by OpenAIRE’s full-text mining
397
                            algorithms
398
                            that scan the plaintexts of publications for funding information, references to datasets,
399
                            software URIs, accession numbers of bioetities, and EPO patent mentions.
400
                            Custom mining modules also link research objects to specific research communities, initiatives
401
                            and infrastructures.
402
                            In addition, other inference modules provide content-based document classification, document
403
                            similarity, citation matching, and author affiliation matching.
404
                            <br><br>
405
                            <span class="portal-color">Project mining</span>
406
                            in OpenAIRE text mines the full-texts of publications in order to extract matches to funding
407
                            project codes/IDs.
408
                            The mining algorithm works by utilising
409
                            (i) the grant identifier, and
410
                            (ii) the project acronym (if available) of each project.
411
                            The mining algorithm:
412
                            (1) Preprocesses/normalizes the full-texts using several functions, which depend on the
413
                            characteristics of each funder (i.e., the format of the grant identifiers), such as stopword
414
                            and/or punctuation removal, tokenization, stemming, converting to lowercase; then
415
                            (2) String matching of grant identifiers against the normalized text is done using database
416
                            techniques; and
417
                            (3) The results are validated and cleaned using the context near the match by looking at the
418
                            context around the matched ID for relevant metadata and positive or negative words/phrases, in
419
                            order to calculate a confidence value for each publication-->project link.
420
                            A confidence threshold is set to optimise high accuracy while minimising false positives, such
421
                            as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or
422
                            URLs, accession numbers.
423
                            The algorithm also applies rules for disambiguating results, as different funders can share
424
                            identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix
425
                            but
426
                            also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging
427
                            techniques to measure the neurobiological effects of sleep apnea”.
428
                            Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE.
429
                            Performance results vary from funder to funder but precision is higher than 98% for all
430
                            funders
431
                            and 99.5% for EC projects.
432
                            Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using
433
                            project/grant IDs.
434
                            <br><br>
435
                            <span class="portal-color">Dataset extraction</span>
436
                            runs on publications full-texts as described in “High pass text-filtering for Citation
437
                            matching”, TPDL 2017[1].
438
                            In particular, we search for citations to datasets using their DOIs, titles and other metadata
439
                            (i.e., dates, creator names, publishers, etc.).
440
                            We extract parts of the text which look like citations and search for datasets using database
441
                            join and pattern matching techniques.
442
                            Based on the experiments described in the paper, precision of the dataset extraction module is
443
                            98.5% and recall is 97.4% but it is also probably overestimated since it does not take into
444
                            account corruptions that may take place during pdf to text extraction.
445
                            It is calculated on the extracted full-texts of small samples from PubMed and arXiv.
446
                            <br><br>
447
                            <span class="portal-color">Software extraction</span>
448
                            runs also on parts of the text which look like citations.
449
                            We search the citations for links to software in open software repositories, specifically
450
                            github, sourceforge, bitbucket and the google code archive.
451
                            After that, we search for links that are included in Software Heritage (SH,
452
                            https://www.softwareheritage.org) and return the permanent URL that SH provides for each
453
                            software project.
454
                            We also enrich this content with user names, titles and descriptions of the software projects
455
                            using web mining techniques.
456
                            Since software mining is based on URL matching, our precision is 100% (we return a software
457
                            link
458
                            only if we find it in the text and there is no need to disambiguate).
459
                            As for recall rate, this is not calculable for this mining task.
460
                            Although we apply all the necessary normalizations to the URLs in order to overcome usual
461
                            issues
462
                            (e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases
463
                            where a software is mentioned using its name and not by a link from the supported software
464
                            repositories.
465
                            <br><br>
466
                            <span class="portal-color">For the extraction of bio-entities</span>, we focus on Protein Data
467
                            Bank (PDB) entries.
468
                            We have downloaded the database with PDB codes and we update it regularly.
469
                            We search through the whole publication’s full-text for references to PDB codes.
470
                            We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes
471
                            or
472
                            other issues) so that we return valid results.
473
                            Current precision is 98%.
474
                            Although it's risky to mention recall rates since these are usually overestimated, we have
475
                            calculated a recall rate of 98% using small samples from pubmed publications.
476
                            Moreover, our technique is able to identify about 30% more links to proteins than the ones
477
                            that
478
                            are tagged in Pubmed xmls.
479
                            <br><br>
480
                            <span class="portal-color">Other text-mining modules</span> include mining for links to EPO
481
                            patents, or custom mining modules for linking research objects to specific research
482
                            communities,
483
                            initiatives and infrastructures, e.g. COVID-19 mining module.
484
                            Apart from text-mining modules, OpenAIRE also provides a document classification service that
485
                            employs analysis of free text stemming from the abstracts of the publications.
486
                            The purpose of applying a document classification module is to assign a scientific text one or
487
                            more predefined content classes.
488
                            In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and
489
                            DDC (Dewey Decimal Classification, or Dewey Decimal System).
490
                            <br><br>
491
                            <hr>
492
                            [1] Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., & Ioannidis, Y. (2017, September).
493
                            High-Pass Text Filtering for Citation Matching.
494
                            In International Conference on Theory and Practice of Digital Libraries (pp. 355-366).
495
                            Springer,
496
                            Cham.
497
                          </div>
498
                        </div>
499
                      </div>
500
                      <div *ngIf="!enrichmentMiningReadMore" class="uk-width-3-5@m uk-text-center clickable"
501
                           (click)="enrichmentMiningReadMore = true">
502
                        <a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
503
                      </div>
504
                      <div *ngIf="enrichmentMiningReadMore" class="uk-width-3-5@m uk-text-center clickable"
505
                           (click)="enrichmentMiningReadMore = false">
506
                        <a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
507
                      </div>
508
                    </li>
509
                    <li>
510
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
511
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
512
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
513
                        The Deduction process (also known as “bulk tagging”) enriches each record with new information
514
                        that
515
                        can be derived from the existing property values.
516
                        <br><br>
517
                        As of September 2020, three procedures are in place to relate a research product to a research
518
                        initiative, infrastructure (RI) or community (RC) based on:
519
                        <ul class="portal-circle">
520
                          <li>subjects (2.7M results tagged)</li>
521
                          <li>Zenodo community (16K results tagged)</li>
522
                          <li>the data source it comes from (250K results tagged)</li>
523
                        </ul>
524
                        The list of subjects, Zenodo communities and data sources used to enrich the products are defined
525
                        by
526
                        the managers of the community gateway or infrastructure monitoring dashboard associated with the
527
                        RC/RI.
528
                      </div>
529
                    </li>
530
                    <li>
531
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
532
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
533
                      <div
534
                          [class]="'uk-margin-bottom uk-margin-medium-right uk-text-small '+(enrichmentPropagationReadMore ? '' : 'lines-18 multi-line-ellipsis')">
535
                        <div>
536
                          <div>
537
                            This process “propagates” properties and links from one product to another if between the two
538
                            there is a “strong” semantic relationship.
539
                            <br><br>
540
                            As of September 2020, the following procedures are in place:
541
                            <ul class="portal-circle">
542
                              <li>
543
                                Propagation of the property “country” to results from institutional repositories:
544
                                e.g. publication collected from an institutional repository maintained by an italian
545
                                university will be enriched with the property “country = IT”.
546
                              </li>
547
                              <li>
548
                                Propagation of links to projects: e.g. publication linked to project P “is supplemented
549
                                by”
550
                                a dataset D.
551
                                Dataset D will get the link to project P.
552
                                The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
553
                              </li>
554
                              <li>
555
                                Propagation of related community/infrastructure/initiative from organizations to products
556
                                via affiliation relationships: e.g. a publication with an author affiliated with
557
                                organization O.
558
                                The manager of the community gateway C declared that the outputs of O are all relevant for
559
                                his/her community C.
560
                                The publication is tagged as relevant for C.
561
                              </li>
562
                              <li>
563
                                Propagation of related community/infrastructure/initiative to related products: e.g.
564
                                publication associated to community C is supplemented by a dataset D.
565
                                Dataset D will get the association to C.
566
                                The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
567
                              </li>
568
                              <li>
569
                                Propagation of ORCID identifiers to related products, if the products have the same
570
                                authors:
571
                                e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D
572
                                has
573
                                the same authors as the publication. Authors of D are enriched with the ORCIDs available
574
                                in
575
                                the publication.
576
                                The relationships considered for this procedure are “isSupplementedBy” and “supplements”.
577
                              </li>
578
                            </ul>
579
                          </div>
580
                        </div>
581
                      </div>
582
                      <div *ngIf="!enrichmentPropagationReadMore" class="uk-width-3-5@m uk-text-center clickable"
583
                           (click)="enrichmentPropagationReadMore = true">
584
                        <a class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
585
                      </div>
586
                      <div *ngIf="enrichmentPropagationReadMore" class="uk-width-3-5@m uk-text-center clickable"
587
                           (click)="enrichmentPropagationReadMore = false">
588
                        <a class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
589
                      </div>
590
                    </li>
591
                  </ul>
592
                </div>
593
                <!--              </div>-->
594
                <!--              <div class="uk-width-expand">-->
595
                <!--                <img src="assets/graph-assets/about/architecture/enrichment.svg">-->
596
                <!--              </div>-->
597
              </div>
598
            </li>
599
            <li>
600
              <div class="uk-text-small uk-margin-large-top">
601
                <!--              <div class="uk-width-3-5@m">-->
602
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
603
                     src="assets/graph-assets/about/architecture/post_cleaning.svg" alt="Post Cleaning">
604
                <div class="uk-margin-bottom uk-margin-medium-right">
605
                  <p>
606
                    Lorem ipsum...
607
                  </p>
608
                </div>
609
                <!--              </div>-->
610
                <!--              <div class="uk-width-expand">-->
611
                <!--                <img src="assets/graph-assets/about/architecture/post_cleaning.svg">-->
612
                <!--              </div>-->
613
              </div>
614
            </li>
615
            <li>
616
              <div class="uk-text-small uk-margin-large-top">
617
                <!--              <div class="uk-width-3-5@m">-->
618
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
619
                     src="assets/graph-assets/about/architecture/indexing.svg" alt="Indexing">
620
                <div class="uk-margin-bottom uk-margin-medium-right">
621
                  <p>
622
                    The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the
623
                    OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party
624
                    applications and organizations, such as:
625
                  </p>
626
                  <ul class="portal-circle">
627
                    <li class="uk-margin-small-bottom">
628
                      <span class="portal-color">EOSC</span>
629
                      --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource
630
                      Catalogue, keeping an up to date map of all research results (publications, datasets, software),
631
                      services, organizations, projects, funders in Europe and beyond.
632
                    </li>
633
                    <li class="uk-margin-small-bottom">
634
                      <span class="portal-color">DSpace & EPrints</span>
635
                      repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via their
636
                      OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding
637
                      project, by selecting it from the list of project provided by OpenAIRE
638
                    </li>
639
                    <li>
640
                      <span class="portal-color">EC participant portal (Sygma - System for Grant Management)</span>
641
                      uses the OpenAIRE API in the “Continuous Reporting” section.
642
                      Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in
643
                      the
644
                      OpenAIRE Research Graph that are linked to the project.
645
                      The user can select the research products from the list and easily compile the continuous reporting
646
                      data of the project.
647
                    </li>
648
                  </ul>
649
                </div>
650
                <!--              </div>-->
651
                <!--              <div class="uk-width-expand">-->
652
                <!--                <img src="assets/graph-assets/about/architecture/indexing.svg">-->
653
                <!--              </div>-->
654
              </div>
655
            </li>
656
            <li>
657
              <div class="uk-text-small uk-margin-large-top">
658
                <!--              <div class="uk-width-3-5@m">-->
659
                <img
660
                    class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image uk-padding-large uk-padding-remove-top uk-padding-remove-horizontal"
661
                    src="assets/graph-assets/about/architecture/stats_analysis.svg" alt="Stats Analysis">
662
                <div class="uk-margin-bottom uk-margin-medium-right">
663
                  <p>
664
                    The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and
665
                    producing
666
                    the charts for funders, research initiative, infrastructures, and policy makers that you can see on
667
                    MONITOR.
668
                    Based on the information available on the graph, OpenAIRE provides a set of indicators for monitoring
669
                    the funding and research impact and the uptake of Open Science publishing practices,
670
                    such as Open Access publishing of publications and datasets, availability of interlinks between
671
                    research
672
                    products, availability of post-print versions in institutional or thematic Open Access repositories,
673
                    etc.
674
                  </p>
675
                </div>
676
                <!--              </div>-->
677
                <!--              <div class="uk-width-expand">-->
678
                <!--                <img src="assets/graph-assets/about/architecture/stats_analysis.svg">-->
679
                <!--              </div>-->
680
              </div>
681
            </li>
682
          </ul>
683
        </div>
684
      </div>
685
      <div class="uk-padding-small uk-margin-top">
686
        <h6>References</h6>
687
        <ul class="uk-text-small portal-circle">
688
          <li>
689
            <a href="https://aka.ms/msracad" target="_blank">Microsoft Academic Graph</a>
690
            which is made available under the ODC Attribution License.
691
            For more information on Microsoft Academic Graph please also read
692
            <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq" target="_blank">here</a>.
693
          </li>
694
          <li>
695
            <a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>
696
          </li>
697
        </ul>
698
      </div>
699
    </div>
700
  </div>
701
  <div id="metrics" class="uk-container uk-container-large uk-section">
702
    <div class="uk-padding-small">
703
      <h2 class="uk-text-center">Data & Metrics</h2>
704
      <h4 class="uk-text-center uk-margin-medium-top portal-color">Coming soon...</h4>
705
      <!--        <div>-->
706
      <!--          <h3 class="uk-margin-medium-top portal-color">Data</h3>-->
707
      <!--          <div></div>-->
708
      <!--        </div>-->
709
      <!--        <div>-->
710
      <!--          <h3 class="uk-margin-medium-top portal-color">Metrics</h3>-->
711
      <!--          <div></div>-->
712
      <!--        </div>-->
713
    </div>
714
  </div>
715
  <div id="infrastructure" class="uk-container uk-container-large uk-section">
716
    <div class="uk-padding-small">
717
      <h2 class="uk-text-center">Infrastructure</h2>
718
      <div>
719
        <div class="uk-flex uk-flex-center">
720
          <p class="uk-width-3-4@m uk-padding-small">
721
            The OpenAIRE graph operates based on a vast variety of hardware and software. As of December 2019, the
722
            hardware infrastructure is the following:
723
          </p>
724
        </div>
725
        <img src="assets/graph-assets/about/infrastructure.png">
726
      </div>
727
    </div>
728
  </div>
729
  <div id="team" class="uk-container uk-container-large uk-section">
730
    <div class="uk-padding-small">
731
      <h2 class="uk-text-center">Team</h2>
732
      <div>
733
        <div class="uk-margin-bottom">
734
          <div class="uk-flex uk-flex-middle uk-grid" uk-grid="">
735
            <div class="uk-text-center uk-width-1-1@s uk-width-1-3@m uk-first-column">
736
              <img src="assets/graph-assets/about/team.svg">
737
            </div>
738

    
739
            <div class="uk-text-center">
740
              <div class="uk-margin-medium-bottom">
741
                Key team members contributing to the Research Graph
742
              </div>
743
              <div>
744
                <a class="uk-button portal-button" target="_blank" href="https://www.openaire.eu/research-graph-team">
745
                  Meet the team
746
                  <span class="space">
747
                <svg height="16" viewBox="0 0 16 16" width="16" xmlns="http://www.w3.org/2000/svg">
748
                  <path
749
                      d="M12.578,4.244,5.667,11.155V8.167A.833.833,0,1,0,4,8.167v5A.832.832,0,0,0,4.833,14h5a.833.833,0,0,0,0-1.667H6.845l6.911-6.911a.833.833,0,1,0-1.178-1.178h0Z"
750
                      fill="#fff" id="arrow-down-left2" transform="translate(7.071 19.799) rotate(-135)">
751
                  </path>
752
                </svg>
753
              </span>
754
                </a>
755
              </div>
756
            </div>
757
          </div>
758
        </div>
759
      </div>
760
    </div>
761
  </div>
762
</div>
(2-2/5)