Deduplication » History » Version 1
Alessia Bardi, 10/11/2021 02:58 PM
1 | 1 | Alessia Bardi | h1. Deduplication |
---|---|---|---|
2 | |||
3 | h2. Deduplication business logic for research results |
||
4 | |||
5 | Metadata records about the same scholarly work can be collected from different providers. Each metadata record can possibly carry different information because, for example, some providers are not aware of links to projects, keywords or other details. Another common case is when OpenAIRE collects one metadata record from a repository about a pre-print and another record from a journal about the published article. For the provision of statistics, OpenAIRE must identify those cases and “merge” the two metadata records, so that the scholarly work is counted only once in the statistics OpenAIRE produces. |
||
6 | |||
7 | Duplicates among research results are identified among results of the same type (publications, datasets, software, other research products). If two duplicate results are aggregated one as a dataset and one as a software, for example, they will never be compared and they will never be identified as duplicates. |
||
8 | OpenAIRE supports different deduplication strategies based on the type of results. |
||
9 | |||
10 | *Methodology overview* |
||
11 | |||
12 | The deduplication process can be divided into two different phases: |
||
13 | * Candidate identification (clustering) |
||
14 | * Decision tree |
||
15 | * Creation of representative record |
||
16 | |||
17 | The implementation of each phase is different based on the type of results that are being processed. |
||
18 | |||
19 | |||
20 | h3. Strategy for publications |
||
21 | |||
22 | _Candidate identification (clustering)_ |
||
23 | |||
24 | |||
25 | Clustering is a common heuristics used to overcome the N x N complexity required to match all pairs of objects to identify the equivalent ones. The challenge is to identify a clustering function that maximizes the chance of comparing only records that may lead to a match, while minimizing the number of records that will not be matched while being equivalent. Since the equivalence function is to some level tolerant to minimal errors (e.g. switching of characters in the title, or minimal difference in letters), we need this function to be not too precise (e.g. a hash of the title), but also not too flexible (e.g. random ngrams of the title). On the other hand, reality tells us that in some cases equality of two records can only be determined by their PIDs (e.g. DOI) as the metadata properties are very different across different versions and no clustering function will ever bring them into the same cluster. To match these requirements OpenAIRE clustering for products works with two functions: |
||
26 | * DOI: the function generates the DOI when this is provided as part of the record properties; |
||
27 | * Title-based function: the function generates a key that depends on (i) number of significant words in the title (normalized, stemming, etc.), (ii) module 10 of the number of characters of such words, and (iii) a string obtained as an alternation of the function prefix(3) and suffix(3) (and vice versa) o the first 3 words (2 words if the title only has 2). For example, the title @“Entity deduplication in big data graphs for scholarly communication”@ becomes @“entity deduplication big data graphs scholarly communication”@ with two keys key @“7.1entionbig”@ and @“7.1itydedbig”@ (where 1 is module 10 of 54 characters of the normalized title. |
||
28 | |||
29 | _Decision tree_ |
||
30 | |||
31 | For each pair of publications in a cluster the following strategy (depicted in the figure below) is applied. |
||
32 | Cross comparison of the pid lists (in the @pid@ and @alternateid@ elements). If 50% common pids, levenshtein distance on titles with low threshold (0.9). |
||
33 | Otherwise, check if the number of authors and the title version is equal. If so, levenshtein distance on titles with higher threshold (0.99). |
||
34 | The publications are matched as duplicate if the distance is higher than the threshold, in every other case they are considered as distinct publications. |
||
35 | |||
36 | !{width: 50%}dedup-results.png! |
||
37 | |||
38 | _Creation of representative record_ |
||
39 | |||
40 | TODO |
||
41 | |||
42 | |||
43 | h3. Strategy for datasets |
||
44 | |||
45 | h3. Strategy for software |
||
46 | |||
47 | h3. Strategy for other types of research products |
||
48 | |||
49 | h3. Clustering functions |
||
50 | |||
51 | _NgramPairs_ |
||
52 | It produces a list of concatenations of a pair of ngrams generated from different words. |
||
53 | Example: |
||
54 | Input string: @“Search for the Standard Model Higgs Boson”@ |
||
55 | Parameters: ngram length = 3 |
||
56 | List of ngrams: @“sea”, “sta”, “mod”, “hig”@ |
||
57 | Ngram pairs: @“seasta”, “stamod”, “modhig”@ |
||
58 | |||
59 | _SuffixPrefix_ |
||
60 | It produces ngrams pairs in a particular way: it concatenates the suffix of a string with the prefix of the next in the input string. |
||
61 | Example: |
||
62 | Input string: @“Search for the Standard Model Higgs Boson”@ |
||
63 | Parameters: suffix and prefix length = 3 |
||
64 | Output list: @“ardmod”@ (suffix of the word @“Standard”@ + prefix of the word @“Model”@), @“rchsta”@ (suffix of the word @“Search”@ + prefix of the word @“Standard”@) |
||
65 | |||
66 | h3. Creation of representative record for Results |