Project

General

Profile

« Previous | Next » 

Revision 60262

Updated documentation on OpenAIRE Research Graph dumps

View differences:

graph-dumps.html
165 165
               <div   uk-grid="" class="uk-grid uk-grid-stack">
166 166
                  <div   class="tm-main uk-width-1-1@s uk-width-1-1@m  uk-width-1-1@l uk-row-first uk-first-column">
167 167
                     <!-- Content GOES HERE-->
168
                     <div class="uk-alert-danger" uk-alert>
169
                        <h3>Contribute to improve the OpenAIRE Research Graph</h3>
170
                        <p>You can explore and test the beta release of the OpenAIRE Research Graph via the <a href="https://beta.explore.openaire.eu">OpenAIRE BETA Explore Portal</a> or via data dumps made available in <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo</a>. </p>
171
                        <p>Help us making the graph ready for its 1st production release by providing your feedback.<br/>
172
                        Go to the <a href="https://trello.com/b/o1tEJ3rN/openaire-research-graph">OpenAIRE Research Graph Trello Board</a> to report content quality issues, including missing metadata records, wrong values, mistakes in the detection of duplicates and anything else that looks "weird" or wrong.
173
                        <p>Find the complete information about the OpenAIRE Research Graph, how to test it and contribute to improve it on <a href="https://www.openaire.eu/blogs/the-openaire-research-graph">our blog</a>.</p>
174
                     </div>
175
                     
176
                     <h2>OpenAIRE Research Graph Dumps</h2>
168
                     <h2>The OpenAIRE Research Graph</h2>
177 169
                     <p>The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. 
178 170
                        Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community. 
179 171
                     </p>
180 172
                     <p>Imagine a vast collection of research products all linked together, contextualised and openly available. 
181
                        For the past ten years OpenAIRE has been working to gather this valuable record. OpenAIRE is pleased to announce the beta release of its Research Graph, a massive collection of metadata and links between 
173
                        For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between 
182 174
                        scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
183 175
                    </p>
184 176
                     <p>As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10,000 data sources trusted by scientists, including repositories registered in <a href="https://v2.sherpa.ac.uk/opendoar/">OpenDOAR</a>, Open Access journals registered in <a href="https://doaj.org/">DOAJ</a>, <a href="https://www.crossref.org/">Crossref</a>, <a href="https://unpaywall.org">Unpaywall</a>, <a href="https://orcid.org/">ORCID</a> and <a href="https://aka.ms/msracad">Microsoft Academic Graph</a>.
......
186 178
                        More than 10Mi full-texts of Open Access publications are mined by algorithms to enrich metadata records with additional properties and links among research products, funders, projects, communities, and organizations. 
187 179
                        Thanks to the mining algorithm, the graph is completed with 480Mi semantic relations.
188 180
                     </p>
189
                     <p>The OpenAIRE Research graph is available via our <a href="https://beta.explore.openaire.eu">BETA Explore Portal</a> and you can download it from <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo</a>.
190
                     </p>
181
                     <p>Detailed information can be found on <a href="https://graph.openaire.eu">https://graph.openaire.eu</a></p>
182
             
191 183
                     
192 184
                     <h3>Get the dumps</h3>
193 185
                    <div>
194
                       <p>The OpenAIRE Research Graph is exported as several dump files available on Zenodo (go to <a href="https://doi.org/10.5281/zenodo.3516917"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3516917.svg" alt="DOI"></a>), so you can download the parts you are interested into. </p>
195
                      <ul>
196
                         <li> <strong>publications</strong>: metadata records about research literature (includes types of publications listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/publication">here</a>)</li>
197
                         <li> <strong>datasets:</strong>: metadata records about research data (includes the subtypes listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset">here</a>)</li>
198
                         <li> <strong>software:</strong>: metadata records about research software (includes the subtypes listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/software">here</a>)</li>
199
                         <li> <strong>orps</strong>: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/other">here</a>)</li>
200
                         <li> <strong>organizations</strong>: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.</li>
201
                         <li> <strong>content_providers</strong>: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.</li>
202
                         <li> <strong>results_by_funder</strong>: metadata records about research results funded by a given funder. Each result includes information about its type (publications, datasets, software or other) and its specific sub-type (check the list of sub-types for <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/publication">publications</a>, <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset">datasets</a>, <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/software">software</a>, and <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/other">other research products</a>).  </li>
203
                      </ul>
204
                       <p>The up-to-date list of funders available on OpenAIRE BETA can be find <a href="https://beta.explore.openaire.eu/search/entity-registries?datasourcetypename=%22Funder%20database%22">here on the BETA Explore portal</a>.</p>
205
                       <p> In the same <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo community</a> you can also find the dumps of ScholeXplorer and DOIBoost.</p>
186
                       <p>In order to facilitate users, different dumps are available. 
187
                          All are available under the <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo community called OpenAIRE Research Graph</a>.
188
                          <ul>
189
                             <li>The <strong>whole OpenAIRE Research Graph Dump</strong><br/>
190
                                Dataset: <a href="https://doi.org/10.5281/zenodo.3516917"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3516917.svg" alt="DOI"></a><br/>
191
                                Schema: <a href="https://doi.org/10.5281/zenodo.4238938"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4238938.svg" alt="DOI"></a><br/>
192
                                This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>
193
                                It is composed of several files so that you can download the parts you are interested into. 
194
                                Each file is a tar archive containing gz files, each with one json per line. 
195
                             </li>
196
                             <li>The <strong>OpenAIRE COVID-19 dump</strong> <br/>
197
                                Dataset: <a href="https://doi.org/10.5281/zenodo.3980490"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3980490.svg" alt="DOI"></a><br/>
198
                                Schema: <a href="https://doi.org/10.5281/zenodo.3974225"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974225.svg" alt="DOI"></a><br/>
199
                                This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>
200
                                It contains metadata records of publications, research data, software and projects on the topic of Corona Virus and COVID-19.
201
                                This dump is part of the <a href="https://www.openaire.eu/openaire-activities-for-covid-19">activities of OpenAIRE to support the fight against COVID-19</a> together with the <a href="https://covid-19.openaire.eu">OpenAIRE COVID-19 Gateway</a>. 
202
                                The dump consists of a tar archive containing gzip files with one json per line. 
203
                             </li>
204
                             <li>
205
                                The <strong>dumps about research communities, initiatives and infrastructures</strong> <br/>
206
                                Dataset: <a href="https://doi.org/10.5281/zenodo.3974604"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974604.svg" alt="DOI"></a><br/>
207
                                Schema: <a href="https://doi.org/10.5281/zenodo.3974225"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974225.svg" alt="DOI"></a><br/>
208
                                This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>
209
                                The dataset contains one file per community/initiative/infrastructure collaborating with OpenAIRE. Check out also their community gateways on <a href="https://connect.openaire.eu">CONNECT</a>.
210
                                Each file is a tar archive containing gzip files with one json per line. 
211
                             </li>
212
                             <li>The dump of <strong>ScholeXplorer</strong> <br/>
213
                                Dataset: <a href="https://doi.org/10.5281/zenodo.1200252"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1200252.svg" alt="DOI"></a><br/>
214
                                Schema (Scholix version 3): <a href="https://doi.org/10.5281/zenodo.1120275"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1120275.svg" alt="DOI"></a><br/>
215
                                This dataset is licensed under a <a rel="license" href="https://creativecommons.org/publicdomain/zero/1.0/">CC0 1.0 Universal (CC0 1.0) Public Domain Dedication</a>.<br/>
216
                                The dataset contains the GZ-compressed dump of the Scholix links exposed by the <a href="https://scholexplorer.openaire.eu">OpenAIRE ScholeXplorer service</a>. 
217
                             </li>
218
                             <li>The dump of <strong>DOIBoost</strong> <br/>
219
                                Dataset: <a href="https://doi.org/10.5281/zenodo.1438355"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1438355.svg" alt="DOI"></a><br/>
220
                                Publication: <a href="https://doi.org/10.5281/zenodo.1441071"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1441071.svg" alt="DOI"></a><br/>
221
                                Software: <a href="https://doi.org/10.5281/zenodo.1441057"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1441057.svg" alt="DOI"></a><br/>
222
                                This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/>
223
                                DOIBoost is a metadata collection that enriches CrossRef with inputs from Microsoft Academic Graph, ORCID, and Unpaywall. 
224
                             </li>
225
                          </ul>
206 226
                    </div>
227
                     <h3>Cite us</h3>
228
                     <p>If you use any of the dumps above for research purposes, please cite it following the reccomendation that you find on the Zenodo page.<br/>
229
                        The OpenAIRE Research Graph and DOIBoost include data from <a href="https://aka.ms/msracad">Microsoft Academic Graph</a> (MAG): please acknowledge also MAG following <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq#license">this guideline</a>.<br/>
230
                     </p>
207 231
                     
232
                   <h3>Still using the old XML dumps?</h3>
208 233
                     <div>
209
                        <p>The dumps contain XML records compliant to the <b>OpenAIRE data model</b> and to the <b>oaf metadata format</b> (the same format of the records exported via <a href="./oai-pmh.html">OAI-PMH</a>):</p>
210
                       <ul>
211
                          <li><a href="" target="_blank">See the description of the OpenAIRE data model</a></li>
212
                          <li><a href="https://www.openaire.eu/schema/latest/oaf.xsd" target="_blank">See the oaf XML schema</a></li>
213
                          <li><a href="https://www.openaire.eu/schema/latest/doc/oaf.html" target="_blank">See the oaf XML schema documentation (generated via Oxygen XML Editor)</a></li>
214
                       </ul>
215
                     <p>Keep reading for instructions on how to consume the dumps.</p>
234
                        Please migrate to the new json dumps. Meanwhile, you can still access the <a href="./graph-dumps-old.html">old documentation here</a>.
216 235
                     </div>
217 236

  
218
                     <h3>Consume the dumps</h3>
219
                     <div>
220
                        Each dump is a gzipped json file with many lines. Each line is in the form of:
221
                        <code>{"_id":{"$oid":"59b82504895be144859a9804"},"body":{"$binary":"base64(zip(XML_record))","$type":"00"}}</code><br/>
222
                        where the <code>body</code> field contains the base64 econding of the compressed XML record. <br/>
223
                        In order to get the XMLs you have to:
224
                        <ol>
225
                           <li>Unzip the file</li>
226
                           <li>Get only the value of the <code>$binary</code> field</li>
227
                           <li>Read each line and base64 decode it</li>
228
                           <li>Unzip the decoded string</li>
229
                        </ol>
230
                        
231
                        For example, to print the XMLs on the standard output you can run this command on MacOS/Unix/Linux based systems:
232
                        <code>gunzip -c file.json.gz | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done </code><br/>
233
                        where 
234
                        <ul>
235
                           <li><code>file.json.gz</code> is the name you gave to the downloaded file dump;</li>
236
                           <li><code>jq</code> is a command to parse json files. It is not installed by default, but you can easy find it on official repositories. <a href="https://stedolan.github.io/jq/download/">Click here for installation instructions</a>.
237
                           <li><code>base64</code> and <code>bsdtar</code> are two libraries that are typically pre-installed.</li>
238
                        </ul>
239
                       Note that you should decide what to do with it (keep parsing XML inline or store them somewhere).
240
                       We suggest to start with few records to test and decide what to do, by adding a <code>head</code> command after the <code>gunzip</code>, like:
241
                     <code>gunzip -c file.json.gz | head -n 10 | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done</code>
242
                     </div>
243
                     
244
                     <h3>Cite us</h3>
245
                     <p>If you use the OpenAIRE Research Graph for research purposes, please cite it as:<br/>
246
                        <i>Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516917</i><br/>
247
                        If you want to cite a specific version, please follow the suggestion on Zenodo. For the current version (1.0.0-beta), please use: </br>
248
                        <i>Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump (Version 1.0.0-beta) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516918</i><br/>
249
                        The OpenAIRE Research graph includes data from <a href="https://aka.ms/msracad">Microsoft Academic Graph</a> (MAG): please acknowledge also MAG following <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq#license">this guideline</a>.
250
                     </p>
251
                     <h3>License</h3>
252
                     <p>The OpenAIRE Research Graph is released under CC-BY license.</p>
253
                     <p>OpenAIRE is working to produce dumps that only contains metadata records that can be re-distributed with the CC0 license: stay tuned!</p>
254 237
                  </div>
255 238
               </div>
256 239
            </div>

Also available in: Unified diff