OpenAIRE Research Graph » History » Version 8
Alessia Bardi, 05/11/2021 02:02 PM
Format for the clustering function section
1 | 1 | Alessia Bardi | h1. The OpenAIRE Research Graph |
---|---|---|---|
2 | |||
3 | The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. |
||
4 | Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community. |
||
5 | |||
6 | Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources. |
||
7 | |||
8 | As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10K data sources trusted by scientists, including: |
||
9 | * Repositories registered in OpenDOAR or re3data.org |
||
10 | * Open Access journals registered in DOAJ |
||
11 | * Crossref |
||
12 | * Unpaywall |
||
13 | * ORCID |
||
14 | * Microsoft Academic Graph |
||
15 | * Datacite |
||
16 | |||
17 | After cleaning, deduplication, enrichment and full-text mining processes, the graph is analysed to produce statistics for OpenAIRE MONITOR (https://monitor.openaire.eu), the Open Science Observatory (https://osobservatory.openaire.eu), made discoverable via OpenAIRE EXPLORE (https://explore.openaire.eu) and programmatically accessible as described at https://develop.openaire.eu. |
||
18 | 3 | Alessia Bardi | Json dumps are also published on Zenodo. |
19 | 1 | Alessia Bardi | |
20 | 3 | Alessia Bardi | TODO: image of high-level data model (entities and semantic relationships, we can draw here: https://docs.google.com/drawings/d/1c4s7Pk2r9NgV_KXkmX6mwCKBIQ-yK3_m6xsxB-3Km1s/edit) |
21 | |||
22 | 1 | Alessia Bardi | h2. Graph Data Dumps |
23 | |||
24 | 2 | Alessia Bardi | In order to facilitate users, different dumps are available. All are available under the "Zenodo community called OpenAIRE Research Graph":https://zenodo.org/communities/openaire-research-graph. |
25 | Here we provide detailed documentation about the full dump: |
||
26 | |||
27 | * Json dump: https://doi.org/10.5281/zenodo.3516917 |
||
28 | * Json schema: https://doi.org/10.5281/zenodo.4238938 |
||
29 | |||
30 | [[Json schema]] |
||
31 | [[FAQ]] |
||
32 | 1 | Alessia Bardi | |
33 | h2. Graph provision processes |
||
34 | |||
35 | 3 | Alessia Bardi | h3. Deduplication business logic |
36 | |||
37 | h4. Deduplication business logic for research results |
||
38 | |||
39 | Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces. |
||
40 | |||
41 | Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates. |
||
42 | OpenAIRE supports different deduplication strategies based on the type of results. |
||
43 | |||
44 | *Methodology overview* |
||
45 | |||
46 | The deduplication process can be divided into two different phases: |
||
47 | * Candidate identification (clustering) |
||
48 | 6 | Alessia Bardi | * Decision tree |
49 | 3 | Alessia Bardi | |
50 | The implementation of each phase is different based on the type of results that are being processed. |
||
51 | |||
52 | |||
53 | *Strategy for publications* |
||
54 | |||
55 | 1 | Alessia Bardi | _Candidate identification (clustering)_ |
56 | 6 | Alessia Bardi | |
57 | 1 | Alessia Bardi | Due to the high number of metadata records collected by OpenAIRE, it would not be feasible to compute all possible comparisons between all metadata records. |
58 | 6 | Alessia Bardi | The goal of this phase is to limit the number of comparisons by creating groups (or clusters) of records that are likely “similar”. Every record can be added to more than one group. |
59 | The decision of inclusion in a group is performed by 2 clustering functions: |
||
60 | * Lowercase: doi (in pid list and alternate identifiers list) |
||
61 | * WordsStatsSuffixPrefixChain: suffixprefix with statistics on the full title (number_of_words & number_of_letters%10) |
||
62 | Example: |
||
63 | If title is : “Search for the Standard Model Higgs Boson” |
||
64 | The clustering function produces 2 keys (i.e. adds the publication to two clusters): [5-3-seaardmod, 5-3-rchstadel] |
||
65 | 3 | Alessia Bardi | |
66 | |||
67 | 6 | Alessia Bardi | _Desicision tree_ |
68 | |||
69 | For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied. |
||
70 | Cross comparison of the pid lists (in the @pid@ and @alternateid@ elements). If 50% common pids, levenshtein distance on titles with low threshold (0.9). |
||
71 | Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99). |
||
72 | The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications. |
||
73 | |||
74 | 3 | Alessia Bardi | !dedup-results.png! |
75 | |||
76 | *Strategy for datasets* |
||
77 | |||
78 | *Strategy for software* |
||
79 | |||
80 | *Strategy for other types of research products* |
||
81 | |||
82 | *Clustering functions* |
||
83 | |||
84 | 8 | Alessia Bardi | _NgramPairs_ |
85 | 3 | Alessia Bardi | It produces a list of concatenations of a pair of ngrams generated from different words. |
86 | Example: |
||
87 | Input string: “Search for the Standard Model Higgs Boson” |
||
88 | Parameters: ngram length = 3 |
||
89 | List of ngrams: “sea”, “sta”, “mod”, “hig” |
||
90 | 1 | Alessia Bardi | Ngram pairs: “seasta”, “stamod”, “modhig” |
91 | 8 | Alessia Bardi | |
92 | _SuffixPrefix_ |
||
93 | 3 | Alessia Bardi | It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. |
94 | Example: |
||
95 | Input string: “Search for the Standard Model Higgs Boson” |
||
96 | Parameters: suffix and prefix length = 3 |
||
97 | Output list: “ardmod” (suffix of the word “Standard” + prefix of the word “Model”), “rchsta” (suffix of the word “Search” + prefix of the word “Standard”) |
||
98 | 1 | Alessia Bardi | |
99 | 3 | Alessia Bardi | h3. TODOs |
100 | 1 | Alessia Bardi | |
101 | * OpenAIRE entity identifier & PID mapping policy |
||
102 | * Aggregation business logic by major sources: |
||
103 | ** Unpaywall integration |
||
104 | ** Crossref integration |
||
105 | ** ORCID integration |
||
106 | ** Cross cleaning actions: hostedBy patch |
||
107 | ** Scholexplorer business logic (relationship resolution) |
||
108 | ** DataCite |
||
109 | 3 | Alessia Bardi | ** EuropePMC |
110 | 1 | Alessia Bardi | ** more…. |
111 | * Deduplication business logic |
||
112 | ** -For research outputs- |
||
113 | ** For research organizations |
||
114 | * Enrichment |
||
115 | ** Mining business logic |
||
116 | ** Deduction-based inference |
||
117 | ** Propagation business logic |
||
118 | * Post-cleaning business logic |
||
119 | * FAQ |