Project

General

Profile

1
<schema2jsonld [URL]="properties.domain + '/about'"
2
               [logoURL]="properties.domain + '/assets/common-assets/logo-small-graph.png'"
3
               [description]="description"
4
               type="other"
5
               [name]="title">
6
</schema2jsonld>
7
<div class="about">
8
  <div class="uk-section">
9
    <div class="uk-margin-large-left uk-margin-medium-bottom">
10
      <breadcrumbs [breadcrumbs]="breadcrumbs"></breadcrumbs>
11
    </div>
12
    <div class="firstBackground">
13
      <div class="uk-container">
14
        <h2 class="uk-text-center">About</h2>
15
        <div class="uk-flex uk-flex-center">
16
          <div class="uk-padding-small uk-width-4-5@m">
17
            <p>
18
              Open Science is gradually becoming the modus operandi in research practices, affecting the way researchers
19
              collaborate and publish, discover, and access scientific knowledge.
20
              Scientists are increasingly publishing research results beyond the article, to share all scientific
21
              products (metadata and files) generated during an experiment, such as datasets, software, experiments.
22
              They publish in scholarly communication data sources (e.g. institutional repositories, data archives,
23
              software repositories), rely where possible on persistent identifiers (e.g. DOI, ORCID, Grid.ac, PDBs),
24
              specify semantic links to other research products (e.g. supplementedBy, citedBy, versionOf), and possibly
25
              to projects and/or relative funders.
26
              By following such practices, scientists are implicitly constructing the Global Open Science Graph, where
27
              by "graph" we mean a collection of objects interlinked by semantic relationships.
28
              <br><br>
29
              The OpenAIRE Research Graph includes metadata and links between scientific products (e.g. literature,
30
              datasets, software, and "other research products"), organizations, funders, funding streams, projects,
31
              communities, and (provenance) data sources - the details of the <a
32
                href="https://zenodo.org/record/2643199#.XOqdstMzZ24" target="_blank">graph data model</a> can be found
33
              in Zenodo.org.
34
              <br><br>
35
              The Graph is available and obtained as an aggregation of the metadata and links collected from ~70.000
36
              trusted sources, further enriched with metadata and links provided by:</p>
37
            <ul class="portal-circle">
38
              <li class="uk-margin-bottom">OpenAIRE end-users, e.g. researchers, project administrators, data curators
39
                providing links from scientific products to projects, funders, communities, or other products;
40
              </li>
41
              <li class="uk-margin-bottom">OpenAIRE Full-text mining algorithms over around ~10Mi Open Access article
42
                full-texts;
43
              </li>
44
              <li>Research infrastructure scholarly services, bridged to the graph via OpenAIRE, exposing metadata of
45
                products such as research workflows, experiments, research objects, software, etc..
46
              </li>
47
            </ul>
48
          </div>
49
        </div>
50
      </div>
51
    </div>
52
  </div>
53
  <div id="architecture" class="uk-container uk-section">
54
    <div class="uk-padding-small">
55
      <h2 class="uk-text-center">Architecture</h2>
56
      <div class="uk-flex uk-flex-center">
57
        <div class="uk-width-4-5@m">
58
          <h3 class="uk-margin-medium-top portal-color">How we build it</h3>
59
          <div>
60
            <p>
61
              OpenAIRE collects metadata records from more than 70K scholarly communication sources from all over the
62
              world, including Open Access institutional repositories, data archives, journals.
63
              All the metadata records (i.e. descriptions of research products) are put together in a data lake,
64
              together
65
              with records from Crossref, Unpaywall, ORCID, Grid.ac, and information about projects provided by national
66
              and international funders.
67
              Dedicated inference algorithms applied to metadata and to the full-texts of Open Access publications
68
              enrich
69
              the content of the data lake with links between research results and projects, author affiliations,
70
              subject
71
              classification, links to entries from domain-specific databases.
72
              Duplicated organisations and results are identified and merged together to obtain an open, trusted, public
73
              resource enabling explorations of the scholarly communication landscape like never before.
74
            </p>
75

    
76
          </div>
77
        </div>
78
      </div>
79
      <div class="uk-flex uk-flex-center uk-inline uk-margin-medium-top">
80
        <img [src]="'assets/graph-assets/about/architecture/'+architectureImage"
81
             class="uk-width-4-5 architecture-image">
82
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 27%; top: 51.5%"
83
           (click)="changeTab(0)" routerLink="/about" fragment="tabs_card"
84
           (mouseenter)="architectureImage = 'aggregation_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
85
          <action-point [class.uk-invisible]="architectureImage == 'aggregation_hover.png'"></action-point>
86
        </a>
87
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 47%; top: 51.5%"
88
           (click)="changeTab(1)" routerLink="/about" fragment="tabs_card"
89
           (mouseenter)="architectureImage = 'deduplication_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
90
          <action-point [class.uk-invisible]="architectureImage == 'deduplication_hover.png'"></action-point>
91
        </a>
92
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 58%; top: 51.5%"
93
           (click)="changeTab(2)" routerLink="/about" fragment="tabs_card"
94
           (mouseenter)="architectureImage = 'enrichment_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
95
          <action-point [class.uk-invisible]="architectureImage == 'enrichment_hover.png'"></action-point>
96
        </a>
97
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 70%; top: 51.5%"
98
           (click)="changeTab(3)" routerLink="/about" fragment="tabs_card"
99
           (mouseenter)="architectureImage = 'post_cleaning_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
100
          <action-point [class.uk-invisible]="architectureImage == 'post_cleaning_hover.png'"></action-point>
101
        </a>
102
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 75%; top: 35%;"
103
           (click)="changeTab(4)" routerLink="/about" fragment="tabs_card"
104
           (mouseenter)="architectureImage = 'indexing_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
105
          <action-point [class.uk-invisible]="architectureImage == 'indexing_hover.png'"></action-point>
106
        </a>
107
        <a class="uk-position-absolute uk-transform-center uk-padding" style="left: 75%; top: 72%"
108
           (click)="changeTab(5)" routerLink="/about" fragment="tabs_card"
109
           (mouseenter)="architectureImage = 'stats_analysis_hover.png'" (mouseleave)="architectureImage = 'gray.png'">
110
          <action-point [class.uk-invisible]="architectureImage == 'stats_analysis_hover.png'"></action-point>
111
        </a>
112
      </div>
113
      <div id="tabs_card"
114
           class="uk-margin-xlarge-top uk-padding-small">
115
        <div class="uk-card uk-card-default uk-card-body architecture-card">
116
          <ul #tabs uk-tab class="uk-tab">
117
            <li><a>Aggregation</a></li>
118
            <li><a>Deduplication</a></li>
119
            <li><a>Enrichment</a></li>
120
            <li><a>Post-Cleaning</a></li>
121
            <li><a>Indexing</a></li>
122
            <li><a>Stats Analysis</a></li>
123
          </ul>
124
          <ul class="uk-switcher uk-margin">
125
            <li>
126
              <div class=" uk-margin-large-top uk-text-small">
127
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
128
                     src="assets/graph-assets/about/architecture/aggregation.png" alt="Aggregation">
129
                <div class="uk-margin-bottom uk-margin-medium-right uk-text-small lines-18"
130
                     [class.multi-line-ellipsis]="!aggregationReadMore">
131
                  <div>
132
                    OpenAIRE collects metadata records from a variety of content providers as described in
133
                    <a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>.
134
                    <br><br>
135
                    OpenAIRE aggregates metadata records describing objects of the research life-cycle from content
136
                    providers compliant to the
137
                    <a href="https://guidelines.openaire.eu" target="_blank">OpenAIRE guidelines</a>
138
                    and from entity registries (i.e. data sources offering authoritative lists of entities, like
139
                    OpenDOAR,
140
                    re3data, DOAJ, and funder databases).
141
                    After collection, metadata are transformed according to the OpenAIRE internal metadata model, which
142
                    is
143
                    used to generate the final OpenAIRE Research Graph that you can access from the OpenAIRE portal and
144
                    the
145
                    APIs.
146
                    <br><br>
147
                    The transformation process includes the application of cleaning functions whose goal is to ensure
148
                    that
149
                    values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever
150
                    applicable,
151
                    to a common controlled vocabulary.
152
                    The controlled vocabularies used for cleansing are accessible at
153
                    <a href="http://api.openaire.eu/vocabularies"
154
                       target="_blank">http://api.openaire.eu/vocabularies</a>.
155
                    Each vocabulary features a set of controlled terms, each with one code, one label, and a set of
156
                    synonyms.
157
                    If a synonym is found as field value, the value is updated with the corresponding term.
158
                    Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources
159
                    that
160
                    are too big to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges
161
                    Crossref,
162
                    ORCID, Microsoft Academic Graph, and Unpaywall), and ScholeXplorer, one of the Scholix hubs offering
163
                    a
164
                    large set of links between research literature and data.
165
                  </div>
166
                </div>
167
                <div *ngIf="!aggregationReadMore" class="uk-text-center clickable">
168
                  <a (click)="aggregationReadMore = true" class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
169
                </div>
170
                <div *ngIf="aggregationReadMore" class="uk-text-center clickable">
171
                  <a (click)="aggregationReadMore = false" routerLink="./" fragment="tabs_card" class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
172
                </div>
173
              </div>
174
            </li>
175
            <li>
176
              <div class="uk-margin-bottom uk-text-small">
177
                  <ul class="uk-subnav button-tab" uk-switcher>
178
                    <li><a>Clustering</a></li>
179
                    <li><a>Matching & Election</a></li>
180
                  </ul>
181
                  <ul class="uk-switcher uk-margin align-list">
182
                    <li>
183
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
184
                           src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
185
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small lines-18"
186
                           [class.multi-line-ellipsis]="!dedupClusteringReadMore">
187
                        <div>
188
                          <div>
189
                            Clustering is a common heuristics used to overcome the N x N complexity required to match
190
                            all
191
                            pairs of objects to identify the equivalent ones.
192
                            The challenge is to identify a clustering function that maximizes the chance of comparing
193
                            only
194
                            records that may lead to a match, while minimizing the number of records that will not be
195
                            matched while being equivalent.
196
                            Since the equivalence function is to some level tolerant to minimal errors (e.g. switching
197
                            of
198
                            characters in the title, or minimal difference in letters), we need this function to be not
199
                            too
200
                            precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the
201
                            title).
202
                            On the other hand, reality tells us that in some cases equality of two records can only be
203
                            determined by their PIDs (e.g. DOI) as the metadata properties are very different across
204
                            different versions and no clustering function will ever bring them into the same cluster.
205
                            To match these requirements OpenAIRE clustering for products works with two functions:
206
                          </div>
207
                          <ul class="portal-circle">
208
                            <li>
209
                              <div>DOI: the function generates the DOI when this is provided as part of the record
210
                                properties;
211
                              </div>
212
                            </li>
213
                            <li>
214
                              <div>
215
                                Title-based function: the function generates a key that depends on
216
                                (i) number of significant words in the title (normalized, stemming, etc.),
217
                                (ii) module 10 of the number of characters of such words, and
218
                                (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and
219
                                vice
220
                                versa) o the first 3 words (2 words if the title only has 2). For example, the title
221
                                “Entity
222
                                deduplication in big data graphs for scholarly communication” becomes “entity
223
                                deduplication
224
                                big data graphs scholarly communication” with two keys key “7.1entionbig” and
225
                                “7.1itydedbig”
226
                                (where 1 is module 10 of 54 characters of the normalized title.
227
                              </div>
228
                            </li>
229
                          </ul>
230
                          <div>
231
                            To give an idea, this configuration generates around 77Mi blocks, which we limited to 200
232
                            records each (only 15K blocks are affected by the cut), and entails 260Bi matches. Matches
233
                            in
234
                            a
235
                            block are performed using a “sliding window” set to 80 records. The records are sorted
236
                            lexicographically on a normalized version of their titles. The 1st record is matched against
237
                            all
238
                            the 80 following ones, then the second, etc. for an NlogN complexity.
239
                          </div>
240
                        </div>
241
                      </div>
242
                      <div *ngIf="!dedupClusteringReadMore" class="uk-text-center clickable">
243

    
244
                        <a (click)="dedupClusteringReadMore = true" class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
245
                      </div>
246
                      <div *ngIf="dedupClusteringReadMore" class="uk-text-center clickable">
247
                        <a (click)="dedupClusteringReadMore = false;" routerLink="./" fragment="tabs_card" class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
248
                      </div>
249
                    </li>
250
                    <li>
251
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
252
                           src="assets/graph-assets/about/architecture/deduplication.svg" alt="Deduplication">
253
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small lines-18"
254
                           [class.multi-line-ellipsis]="!dedupMatchingAndElectionReadMore">
255
                        <div>
256
                          <div>
257
                            Once the clusters have been built, the algorithm proceeds with the comparisons.
258
                            Comparisons are driven by a decisional tree that:
259
                          </div>
260
                          <ul class="uk-list">
261
                            <li class="uk-margin-small-bottom">
262
                              <div>
263
                                <span class="portal-color">1.</span> Tries to capture equivalence via PIDs: if records
264
                                share
265
                                a PID then they are equivalent
266
                              </div>
267
                            </li>
268
                            <li class="uk-margin-small-bottom">
269
                              <div>
270
                                <span class="portal-color">2.</span> Tries to capture difference:
271
                              </div>
272
                              <ul class="uk-list">
273
                                <li class="uk-margin-small-bottom">
274
                                  <div>
275
                                    <span class="portal-color">a.</span>
276
                                    If record titles contain different “numbers” then they are different (this rule is
277
                                    subject to different feelings, and should be fine-tuned);
278
                                  </div>
279
                                </li>
280
                                <li class="uk-margin-small-bottom">
281
                                  <div>
282
                                    <span class="portal-color">b.</span>
283
                                    If record contain different number of authors then they are different;
284
                                  </div>
285
                                </li>
286
                                <li class="uk-margin-small-bottom">
287
                                  <div>
288
                                    <span class="portal-color">c.</span>
289
                                    Note that different PIDs do not imply different records, as different versions may
290
                                    have
291
                                    different PIDs.
292
                                  </div>
293
                                </li>
294
                              </ul>
295
                            </li>
296
                            <li>
297
                              <div><span class="portal-color">3.</span> Measures equivalence:</div>
298
                              <ul class="uk-list portal-circle">
299
                                <li>
300
                                  <div>
301
                                    The titles of the two records are normalised and compared for similarity by applying
302
                                    the
303
                                    Levenstein distance algorithm.
304
                                    The algorithm returns a number in the range [0,1], where 0 means “very different”
305
                                    and
306
                                    1
307
                                    means “equal”.
308
                                    If the distance is greater than or equal 0,99 the two records are identified as
309
                                    duplicates.
310
                                  </div>
311
                                </li>
312
                                <li>
313
                                  <div>Dates are not regarded for equivalence matching because different versions of the
314
                                    same records should be merged and may be published on different dates, e.g.
315
                                    pre-print
316
                                    and published version of an article.
317
                                  </div>
318
                                </li>
319
                              </ul>
320
                            </li>
321
                          </ul>
322
                          <div>
323
                            Once the equivalence relationships between pairs of records are set, the groups of
324
                            equivalent
325
                            records are obtained (transitive closure, i.e. “mesh”).
326
                            From such sets a new representative object is obtained, which inherits all properties from
327
                            the
328
                            merged records and keeps track of their provenance.
329
                            The ID of the record is obtained by appending the prefix “dedup_” to the MD5 of the first ID
330
                            (given their lexicographical ordering).
331
                            A new, more stable function to generate the ID is under development, which exploits the DOI
332
                            when
333
                            one of the records to be merged includes a Crossref or a DataCite record.
334
                          </div>
335
                        </div>
336
                      </div>
337
                      <div *ngIf="!dedupMatchingAndElectionReadMore" class="uk-text-center clickable">
338
                        <a (click)="dedupMatchingAndElectionReadMore = true" class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
339
                      </div>
340
                      <div *ngIf="dedupMatchingAndElectionReadMore" class="uk-text-center clickable">
341
                        <a (click)="dedupMatchingAndElectionReadMore = false" routerLink="./" fragment="tabs_card" class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
342
                      </div>
343
                    </li>
344
                  </ul>
345
                </div>
346
            </li>
347
            <li>
348
              <div class="uk-margin-bottom uk-text-small">
349
                  <ul class="uk-subnav button-tab uk-grid uk-grid-small" uk-switcher>
350
                    <li><a>Mining</a></li>
351
                    <li><a>Bulk tagging/ Deduction</a></li>
352
                    <li><a>Propagation</a></li>
353
                  </ul>
354
                  <ul class="uk-switcher uk-margin">
355
                    <li>
356
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
357
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
358
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small lines-18"
359
                           [class.multi-line-ellipsis]="!enrichmentMiningReadMore">
360
                        <div>
361
                          The OpenAIRE Research Graph is enriched by links mined by OpenAIRE’s full-text mining
362
                          algorithms
363
                          that scan the plaintexts of publications for funding information, references to datasets,
364
                          software URIs, accession numbers of bioetities, and EPO patent mentions.
365
                          Custom mining modules also link research objects to specific research communities, initiatives
366
                          and infrastructures.
367
                          In addition, other inference modules provide content-based document classification, document
368
                          similarity, citation matching, and author affiliation matching.
369
                          <br><br>
370
                          <span class="portal-color">Project mining</span>
371
                          in OpenAIRE text mines the full-texts of publications in order to extract matches to funding
372
                          project codes/IDs.
373
                          The mining algorithm works by utilising
374
                          (i) the grant identifier, and
375
                          (ii) the project acronym (if available) of each project.
376
                          The mining algorithm:
377
                          (1) Preprocesses/normalizes the full-texts using several functions, which depend on the
378
                          characteristics of each funder (i.e., the format of the grant identifiers), such as stopword
379
                          and/or punctuation removal, tokenization, stemming, converting to lowercase; then
380
                          (2) String matching of grant identifiers against the normalized text is done using database
381
                          techniques; and
382
                          (3) The results are validated and cleaned using the context near the match by looking at the
383
                          context around the matched ID for relevant metadata and positive or negative words/phrases, in
384
                          order to calculate a confidence value for each publication-->project link.
385
                          A confidence threshold is set to optimise high accuracy while minimising false positives, such
386
                          as matches with page or report numbers, post/zip codes, parts of telephone numbers, DOIs or
387
                          URLs, accession numbers.
388
                          The algorithm also applies rules for disambiguating results, as different funders can share
389
                          identical project IDs; for example, grant number 633172 could refer to H2020 project EuroMix
390
                          but
391
                          also to Australian-funded NHMRC project “Brain activity (EEG) analysis and brain imaging
392
                          techniques to measure the neurobiological effects of sleep apnea”.
393
                          Project mining works very well and was the first Text & Data Mining (TDM) service of OpenAIRE.
394
                          Performance results vary from funder to funder but precision is higher than 98% for all
395
                          funders
396
                          and 99.5% for EC projects.
397
                          Recall is higher than 95% (99% for EC projects), when projects are properly acknowledged using
398
                          project/grant IDs.
399
                          <br><br>
400
                          <span class="portal-color">Dataset extraction</span>
401
                          runs on publications full-texts as described in “High pass text-filtering for Citation
402
                          matching”, TPDL 2017[1].
403
                          In particular, we search for citations to datasets using their DOIs, titles and other metadata
404
                          (i.e., dates, creator names, publishers, etc.).
405
                          We extract parts of the text which look like citations and search for datasets using database
406
                          join and pattern matching techniques.
407
                          Based on the experiments described in the paper, precision of the dataset extraction module is
408
                          98.5% and recall is 97.4% but it is also probably overestimated since it does not take into
409
                          account corruptions that may take place during pdf to text extraction.
410
                          It is calculated on the extracted full-texts of small samples from PubMed and arXiv.
411
                          <br><br>
412
                          <span class="portal-color">Software extraction</span>
413
                          runs also on parts of the text which look like citations.
414
                          We search the citations for links to software in open software repositories, specifically
415
                          github, sourceforge, bitbucket and the google code archive.
416
                          After that, we search for links that are included in Software Heritage (SH,
417
                          https://www.softwareheritage.org) and return the permanent URL that SH provides for each
418
                          software project.
419
                          We also enrich this content with user names, titles and descriptions of the software projects
420
                          using web mining techniques.
421
                          Since software mining is based on URL matching, our precision is 100% (we return a software
422
                          link
423
                          only if we find it in the text and there is no need to disambiguate).
424
                          As for recall rate, this is not calculable for this mining task.
425
                          Although we apply all the necessary normalizations to the URLs in order to overcome usual
426
                          issues
427
                          (e.g., http or https, existence of www or not, lower/upper case), we do not calculate cases
428
                          where a software is mentioned using its name and not by a link from the supported software
429
                          repositories.
430
                          <br><br>
431
                          <span class="portal-color">For the extraction of bio-entities</span>, we focus on Protein Data
432
                          Bank (PDB) entries.
433
                          We have downloaded the database with PDB codes and we update it regularly.
434
                          We search through the whole publication’s full-text for references to PDB codes.
435
                          We apply disambiguation rules (e.g., there are PDB codes that are the same as antibody codes
436
                          or
437
                          other issues) so that we return valid results.
438
                          Current precision is 98%.
439
                          Although it's risky to mention recall rates since these are usually overestimated, we have
440
                          calculated a recall rate of 98% using small samples from pubmed publications.
441
                          Moreover, our technique is able to identify about 30% more links to proteins than the ones
442
                          that
443
                          are tagged in Pubmed xmls.
444
                          <br><br>
445
                          <span class="portal-color">Other text-mining modules</span> include mining for links to EPO
446
                          patents, or custom mining modules for linking research objects to specific research
447
                          communities,
448
                          initiatives and infrastructures, e.g. COVID-19 mining module.
449
                          Apart from text-mining modules, OpenAIRE also provides a document classification service that
450
                          employs analysis of free text stemming from the abstracts of the publications.
451
                          The purpose of applying a document classification module is to assign a scientific text one or
452
                          more predefined content classes.
453
                          In OpenAIRE, the currently used taxonomies are arXiv, MeSH (Medical Subject Headings), ACM and
454
                          DDC (Dewey Decimal Classification, or Dewey Decimal System).
455
                          <br><br>
456
                          <hr>
457
                          [1] Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., & Ioannidis, Y. (2017, September).
458
                          High-Pass Text Filtering for Citation Matching.
459
                          In International Conference on Theory and Practice of Digital Libraries (pp. 355-366).
460
                          Springer,
461
                          Cham.
462
                        </div>
463
                      </div>
464
                      <div *ngIf="!enrichmentMiningReadMore" class="uk-text-center clickable">
465
                        <a (click)="enrichmentMiningReadMore = true" class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
466
                      </div>
467
                      <div *ngIf="enrichmentMiningReadMore" class="uk-text-center clickable">
468
                        <a (click)="enrichmentMiningReadMore = false" routerLink="./" fragment="tabs_card" class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
469
                      </div>
470
                    </li>
471
                    <li>
472
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
473
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
474
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small">
475
                        The Deduction process (also known as “bulk tagging”) enriches each record with new information
476
                        that
477
                        can be derived from the existing property values.
478
                        <br><br>
479
                        As of September 2020, three procedures are in place to relate a research product to a research
480
                        initiative, infrastructure (RI) or community (RC) based on:
481
                        <ul class="portal-circle">
482
                          <li>subjects (2.7M results tagged)</li>
483
                          <li>Zenodo community (16K results tagged)</li>
484
                          <li>the data source it comes from (250K results tagged)</li>
485
                        </ul>
486
                        The list of subjects, Zenodo communities and data sources used to enrich the products are
487
                        defined
488
                        by
489
                        the managers of the community gateway or infrastructure monitoring dashboard associated with the
490
                        RC/RI.
491
                      </div>
492
                    </li>
493
                    <li>
494
                      <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
495
                           src="assets/graph-assets/about/architecture/enrichment.svg" alt="Enrichment">
496
                      <div class="uk-margin-bottom uk-margin-medium-right uk-text-small lines-18"
497
                           [class.multi-line-ellipsis]="!enrichmentPropagationReadMore">
498
                          <div>
499
                            This process “propagates” properties and links from one product to another if between the
500
                            two
501
                            there is a “strong” semantic relationship.
502
                            <br><br>
503
                            As of September 2020, the following procedures are in place:
504
                            <ul class="portal-circle">
505
                              <li>
506
                                Propagation of the property “country” to results from institutional repositories:
507
                                e.g. publication collected from an institutional repository maintained by an italian
508
                                university will be enriched with the property “country = IT”.
509
                              </li>
510
                              <li>
511
                                Propagation of links to projects: e.g. publication linked to project P “is supplemented
512
                                by”
513
                                a dataset D.
514
                                Dataset D will get the link to project P.
515
                                The relationships considered for this procedure are “isSupplementedBy” and
516
                                “supplements”.
517
                              </li>
518
                              <li>
519
                                Propagation of related community/infrastructure/initiative from organizations to
520
                                products
521
                                via affiliation relationships: e.g. a publication with an author affiliated with
522
                                organization O.
523
                                The manager of the community gateway C declared that the outputs of O are all relevant
524
                                for
525
                                his/her community C.
526
                                The publication is tagged as relevant for C.
527
                              </li>
528
                              <li>
529
                                Propagation of related community/infrastructure/initiative to related products: e.g.
530
                                publication associated to community C is supplemented by a dataset D.
531
                                Dataset D will get the association to C.
532
                                The relationships considered for this procedure are “isSupplementedBy” and
533
                                “supplements”.
534
                              </li>
535
                              <li>
536
                                Propagation of ORCID identifiers to related products, if the products have the same
537
                                authors:
538
                                e.g. publication has ORCID for its authors and is supplemented by a dataset D. Dataset D
539
                                has
540
                                the same authors as the publication. Authors of D are enriched with the ORCIDs available
541
                                in
542
                                the publication.
543
                                The relationships considered for this procedure are “isSupplementedBy” and
544
                                “supplements”.
545
                              </li>
546
                            </ul>
547
                          </div>
548
                      </div>
549
                      <div *ngIf="!enrichmentPropagationReadMore" class="uk-text-center clickable">
550
                        <a (click)="enrichmentPropagationReadMore = true" class="custom-explore-toggle">Read more<span uk-icon="chevron-down"></span></a>
551
                      </div>
552
                      <div *ngIf="enrichmentPropagationReadMore" class="uk-text-center clickable">
553
                        <a (click)="enrichmentPropagationReadMore = false" routerLink="./" fragment="tabs_card" class="custom-explore-toggle">Read less<span uk-icon="chevron-up"></span></a>
554
                      </div>
555
                    </li>
556
                  </ul>
557
                </div>
558
            </li>
559
            <li>
560
              <div class="uk-text-small uk-margin-large-top">
561
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
562
                     src="assets/graph-assets/about/architecture/post_cleaning.svg" alt="Post Cleaning">
563
                <div class="uk-margin-bottom uk-margin-medium-right">
564
                  <p>
565
                    The aggregation processes are continuously running and apply vocabularies as they are in a given
566
                    moment of time.
567
                    It could be the case that a vocabulary changes after the aggregation of one data source has
568
                    finished, thus the aggregated content does not reflect the current status of the controlled
569
                    vocabularies.
570
                    <br><br>
571
                    In addition, the integration of ScholeXplorer and DOIBoost and some enrichment processes applied on
572
                    the raw and on the de-duplicated graph may introduce values that do not comply with the current
573
                    status of the OpenAIRE controlled vocabularies.
574
                    For these reasons, we included a final step of cleansing at the end of the workflow materialisation.
575
                    The output of the final cleansing step is the final version of the OpenAIRE Research Graph.
576
                  </p>
577
                </div>
578
              </div>
579
            </li>
580
            <li>
581
              <div class="uk-text-small uk-margin-large-top">
582
                <img class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image"
583
                     src="assets/graph-assets/about/architecture/indexing.svg" alt="Indexing">
584
                <div class="uk-margin-bottom uk-margin-medium-right">
585
                  <p>
586
                    The final version of the OpenAIRE Research Graph is indexed on a Solr server that is used by the
587
                    OpenAIRE portals (EXPLORE, CONNECT, PROVIDE) and APIs, the latter adopted by several third-party
588
                    applications and organizations, such as:
589
                  </p>
590
                  <ul class="portal-circle">
591
                    <li class="uk-margin-small-bottom">
592
                      <span class="portal-color">EOSC</span>
593
                      --The OpenAIRE Research Graph APIs and Portals will offer to the EOSC an Open Science Resource
594
                      Catalogue, keeping an up to date map of all research results (publications, datasets, software),
595
                      services, organizations, projects, funders in Europe and beyond.
596
                    </li>
597
                    <li class="uk-margin-small-bottom">
598
                      <span class="portal-color">DSpace & EPrints</span>
599
                      repositories can install the OpenAIRE plugin to expose OpenAIRE compliant metadata records via
600
                      their
601
                      OAI-PMH endpoint and offer to researchers the possibility to link their depositions to the funding
602
                      project, by selecting it from the list of project provided by OpenAIRE
603
                    </li>
604
                    <li>
605
                      <span class="portal-color">EC participant portal (Sygma - System for Grant Management)</span>
606
                      uses the OpenAIRE API in the “Continuous Reporting” section.
607
                      Sygma automatically fetches from the OpenAIRE Search API the list of publications and datasets in
608
                      the
609
                      OpenAIRE Research Graph that are linked to the project.
610
                      The user can select the research products from the list and easily compile the continuous
611
                      reporting
612
                      data of the project.
613
                    </li>
614
                  </ul>
615
                </div>
616
              </div>
617
            </li>
618
            <li>
619
              <div class="uk-text-small uk-margin-large-top">
620
                <img
621
                    class="uk-width-2-5@m uk-align-right@m uk-margin-remove-adjacent tab-image uk-padding-large uk-padding-remove-top uk-padding-remove-horizontal"
622
                    src="assets/graph-assets/about/architecture/stats_analysis.svg" alt="Stats Analysis">
623
                <div class="uk-margin-bottom uk-margin-medium-right">
624
                  <p>
625
                    The OpenAIRE Research Graph is also processed by a pipeline for extracting the statistics and
626
                    producing
627
                    the charts for funders, research initiative, infrastructures, and policy makers that you can see on
628
                    MONITOR.
629
                    Based on the information available on the graph, OpenAIRE provides a set of indicators for
630
                    monitoring
631
                    the funding and research impact and the uptake of Open Science publishing practices,
632
                    such as Open Access publishing of publications and datasets, availability of interlinks between
633
                    research
634
                    products, availability of post-print versions in institutional or thematic Open Access repositories,
635
                    etc.
636
                  </p>
637
                </div>
638
              </div>
639
            </li>
640
          </ul>
641
        </div>
642
      </div>
643
      <div class="uk-padding-small uk-margin-top">
644
        <h6>References</h6>
645
        <ul class="uk-text-small portal-circle">
646
          <li>
647
            <a href="https://aka.ms/msracad" target="_blank">Microsoft Academic Graph</a>
648
            which is made available under the ODC Attribution License.<br>
649
            For more information on Microsoft Academic Graph please also read
650
            <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq" target="_blank">here</a>.
651
          </li>
652
          <li>
653
            <a href="https://www.openaire.eu/aggregation-and-content-provision-workflows" target="_blank">https://www.openaire.eu/aggregation-and-content-provision-workflows</a>
654
          </li>
655
        </ul>
656
        <a class="portal-link uk-icon-link uk-text-small uk-text-bold uk-text-uppercase" routerLink="/resources/references">
657
          See all references <icon name="arrow_right" class="uk-margin-small-left"></icon>
658
        </a>
659
      </div>
660
    </div>
661
  </div>
662
  <div id="metrics" class="uk-container uk-container-large uk-section">
663
    <div class="uk-padding-small">
664
      <h2 class="uk-text-center uk-margin-medium-bottom">Data & Metrics</h2>
665
      <numbers colorClass="uk-text-secondary"></numbers>
666
    </div>
667
  </div>
668
  <div id="infrastructure" class="uk-container uk-section">
669
    <div class="uk-padding-small">
670
      <h2 class="uk-text-center">Infrastructure</h2>
671
      <div>
672
        <div class="uk-flex uk-flex-center uk-grid uk-grid-stack">
673
          <p class="uk-width-4-5@m uk-padding-small">
674
            The OpenAIRE Research Graph is operated and maintained at the <a
675
              href="https://icm.edu.pl/en/centre-of-technology/" target="_blank">ICM cutting-edge Technology centre</a>
676
            with the facilities and staff guaranteeing robust operation of the whole system.
677
            Okeanos SuperComputer hosting the graph consists of 26016 cores in total providing 1082 Tflops/s.
678
            Whole setup is energy efficient with 1.554 Gflops/Watts Power Efficiency resulting in 160th place on the
679
            "Top500 by energy-eficiency" list (as of 2019).
680
          </p>
681
          <img class="infrastructure-image uk-margin-top uk-margin-bottom"
682
               src="assets/graph-assets/about/infrastructure.png">
683
          <p class="uk-width-4-5@m uk-padding-small">
684
            ICM supports the continuous operation of the infrastructure including data aggregation, deduplication,
685
            inference and provision ensuring seamless 24/7 system uptime and availability.
686
            System administration activities cover hardware maintenance and provisioning of the new computational
687
            resources, providing High Availability solutions to address resilience to failures by service-level
688
            redundancy and Load Balancing to distribute workloads uniformly across servers.
689
            The most crucial parts of the persisted graph are covered with backups along with well defined restore
690
            procedures.
691
            All the monitoring activities rely on an aggregated system-level monitoring accessible via various
692
            dashboards giving the better overview of system stability and potential requirements for system elements
693
            extension.
694
            System level monitoring is supplemented with monitoring availability of all the publicly accessible
695
            endpoints.
696
            Hence, the offer of the public API of OpenAIRE to third parties, is of high-standards.
697
          </p>
698
          <p class="uk-width-4-5@m uk-padding-small">
699
            All the maintenance operations undertaken by experienced system administrators are founded on well
700
            established routines and emergency maintenance procedures.
701
          </p>
702
        </div>
703
      </div>
704
    </div>
705
  </div>
706
  <div id="team" class="uk-container uk-container-large uk-section">
707
    <div class="uk-padding-small">
708
      <h2 class="uk-text-center">Team</h2>
709
      <div>
710
        <div class="uk-margin-bottom">
711
          <img class="uk-align-center uk-align-left@m uk-margin-remove-adjacent"
712
               src="assets/graph-assets/about/team.svg" alt="Team">
713

    
714
          <div class="uk-text-center uk-width-1-2@m uk-align-center uk-margin-remove-adjacent">
715
            <div class="uk-margin-medium-bottom">
716
              Key team members contributing to the Research Graph
717
            </div>
718
            <div><a class="uk-button portal-button" routerLink="./team">
719
                Meet the team
720
                <icon name="arrow_right" ratio="0.8" class="space"></icon>
721
              </a>
722
            </div>
723
          </div>
724
        </div>
725
      </div>
726
    </div>
727
  </div>
728
</div>
(2-2/8)