Revision 60262
Added by Alessia Bardi over 3 years ago
graph-dumps.html | ||
---|---|---|
165 | 165 |
<div uk-grid="" class="uk-grid uk-grid-stack"> |
166 | 166 |
<div class="tm-main uk-width-1-1@s uk-width-1-1@m uk-width-1-1@l uk-row-first uk-first-column"> |
167 | 167 |
<!-- Content GOES HERE--> |
168 |
<div class="uk-alert-danger" uk-alert> |
|
169 |
<h3>Contribute to improve the OpenAIRE Research Graph</h3> |
|
170 |
<p>You can explore and test the beta release of the OpenAIRE Research Graph via the <a href="https://beta.explore.openaire.eu">OpenAIRE BETA Explore Portal</a> or via data dumps made available in <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo</a>. </p> |
|
171 |
<p>Help us making the graph ready for its 1st production release by providing your feedback.<br/> |
|
172 |
Go to the <a href="https://trello.com/b/o1tEJ3rN/openaire-research-graph">OpenAIRE Research Graph Trello Board</a> to report content quality issues, including missing metadata records, wrong values, mistakes in the detection of duplicates and anything else that looks "weird" or wrong. |
|
173 |
<p>Find the complete information about the OpenAIRE Research Graph, how to test it and contribute to improve it on <a href="https://www.openaire.eu/blogs/the-openaire-research-graph">our blog</a>.</p> |
|
174 |
</div> |
|
175 |
|
|
176 |
<h2>OpenAIRE Research Graph Dumps</h2> |
|
168 |
<h2>The OpenAIRE Research Graph</h2> |
|
177 | 169 |
<p>The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. |
178 | 170 |
Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community. |
179 | 171 |
</p> |
180 | 172 |
<p>Imagine a vast collection of research products all linked together, contextualised and openly available. |
181 |
For the past ten years OpenAIRE has been working to gather this valuable record. OpenAIRE is pleased to announce the beta release of its Research Graph, a massive collection of metadata and links between
|
|
173 |
For the past ten years OpenAIRE has been working to gather this valuable record. It is a massive collection of metadata and links between
|
|
182 | 174 |
scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources. |
183 | 175 |
</p> |
184 | 176 |
<p>As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10,000 data sources trusted by scientists, including repositories registered in <a href="https://v2.sherpa.ac.uk/opendoar/">OpenDOAR</a>, Open Access journals registered in <a href="https://doaj.org/">DOAJ</a>, <a href="https://www.crossref.org/">Crossref</a>, <a href="https://unpaywall.org">Unpaywall</a>, <a href="https://orcid.org/">ORCID</a> and <a href="https://aka.ms/msracad">Microsoft Academic Graph</a>. |
... | ... | |
186 | 178 |
More than 10Mi full-texts of Open Access publications are mined by algorithms to enrich metadata records with additional properties and links among research products, funders, projects, communities, and organizations. |
187 | 179 |
Thanks to the mining algorithm, the graph is completed with 480Mi semantic relations. |
188 | 180 |
</p> |
189 |
<p>The OpenAIRE Research graph is available via our <a href="https://beta.explore.openaire.eu">BETA Explore Portal</a> and you can download it from <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo</a>.
|
|
190 |
</p> |
|
181 |
<p>Detailed information can be found on <a href="https://graph.openaire.eu">https://graph.openaire.eu</a></p>
|
|
182 |
|
|
191 | 183 |
|
192 | 184 |
<h3>Get the dumps</h3> |
193 | 185 |
<div> |
194 |
<p>The OpenAIRE Research Graph is exported as several dump files available on Zenodo (go to <a href="https://doi.org/10.5281/zenodo.3516917"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3516917.svg" alt="DOI"></a>), so you can download the parts you are interested into. </p> |
|
195 |
<ul> |
|
196 |
<li> <strong>publications</strong>: metadata records about research literature (includes types of publications listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/publication">here</a>)</li> |
|
197 |
<li> <strong>datasets:</strong>: metadata records about research data (includes the subtypes listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset">here</a>)</li> |
|
198 |
<li> <strong>software:</strong>: metadata records about research software (includes the subtypes listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/software">here</a>)</li> |
|
199 |
<li> <strong>orps</strong>: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/other">here</a>)</li> |
|
200 |
<li> <strong>organizations</strong>: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.</li> |
|
201 |
<li> <strong>content_providers</strong>: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.</li> |
|
202 |
<li> <strong>results_by_funder</strong>: metadata records about research results funded by a given funder. Each result includes information about its type (publications, datasets, software or other) and its specific sub-type (check the list of sub-types for <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/publication">publications</a>, <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/dataset">datasets</a>, <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/software">software</a>, and <a href="http://api.openaire.eu/vocabularies/dnet:result_typologies/other">other research products</a>). </li> |
|
203 |
</ul> |
|
204 |
<p>The up-to-date list of funders available on OpenAIRE BETA can be find <a href="https://beta.explore.openaire.eu/search/entity-registries?datasourcetypename=%22Funder%20database%22">here on the BETA Explore portal</a>.</p> |
|
205 |
<p> In the same <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo community</a> you can also find the dumps of ScholeXplorer and DOIBoost.</p> |
|
186 |
<p>In order to facilitate users, different dumps are available. |
|
187 |
All are available under the <a href="https://zenodo.org/communities/openaire-research-graph">Zenodo community called OpenAIRE Research Graph</a>. |
|
188 |
<ul> |
|
189 |
<li>The <strong>whole OpenAIRE Research Graph Dump</strong><br/> |
|
190 |
Dataset: <a href="https://doi.org/10.5281/zenodo.3516917"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3516917.svg" alt="DOI"></a><br/> |
|
191 |
Schema: <a href="https://doi.org/10.5281/zenodo.4238938"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.4238938.svg" alt="DOI"></a><br/> |
|
192 |
This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/> |
|
193 |
It is composed of several files so that you can download the parts you are interested into. |
|
194 |
Each file is a tar archive containing gz files, each with one json per line. |
|
195 |
</li> |
|
196 |
<li>The <strong>OpenAIRE COVID-19 dump</strong> <br/> |
|
197 |
Dataset: <a href="https://doi.org/10.5281/zenodo.3980490"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3980490.svg" alt="DOI"></a><br/> |
|
198 |
Schema: <a href="https://doi.org/10.5281/zenodo.3974225"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974225.svg" alt="DOI"></a><br/> |
|
199 |
This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/> |
|
200 |
It contains metadata records of publications, research data, software and projects on the topic of Corona Virus and COVID-19. |
|
201 |
This dump is part of the <a href="https://www.openaire.eu/openaire-activities-for-covid-19">activities of OpenAIRE to support the fight against COVID-19</a> together with the <a href="https://covid-19.openaire.eu">OpenAIRE COVID-19 Gateway</a>. |
|
202 |
The dump consists of a tar archive containing gzip files with one json per line. |
|
203 |
</li> |
|
204 |
<li> |
|
205 |
The <strong>dumps about research communities, initiatives and infrastructures</strong> <br/> |
|
206 |
Dataset: <a href="https://doi.org/10.5281/zenodo.3974604"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974604.svg" alt="DOI"></a><br/> |
|
207 |
Schema: <a href="https://doi.org/10.5281/zenodo.3974225"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.3974225.svg" alt="DOI"></a><br/> |
|
208 |
This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/> |
|
209 |
The dataset contains one file per community/initiative/infrastructure collaborating with OpenAIRE. Check out also their community gateways on <a href="https://connect.openaire.eu">CONNECT</a>. |
|
210 |
Each file is a tar archive containing gzip files with one json per line. |
|
211 |
</li> |
|
212 |
<li>The dump of <strong>ScholeXplorer</strong> <br/> |
|
213 |
Dataset: <a href="https://doi.org/10.5281/zenodo.1200252"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1200252.svg" alt="DOI"></a><br/> |
|
214 |
Schema (Scholix version 3): <a href="https://doi.org/10.5281/zenodo.1120275"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1120275.svg" alt="DOI"></a><br/> |
|
215 |
This dataset is licensed under a <a rel="license" href="https://creativecommons.org/publicdomain/zero/1.0/">CC0 1.0 Universal (CC0 1.0) Public Domain Dedication</a>.<br/> |
|
216 |
The dataset contains the GZ-compressed dump of the Scholix links exposed by the <a href="https://scholexplorer.openaire.eu">OpenAIRE ScholeXplorer service</a>. |
|
217 |
</li> |
|
218 |
<li>The dump of <strong>DOIBoost</strong> <br/> |
|
219 |
Dataset: <a href="https://doi.org/10.5281/zenodo.1438355"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1438355.svg" alt="DOI"></a><br/> |
|
220 |
Publication: <a href="https://doi.org/10.5281/zenodo.1441071"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1441071.svg" alt="DOI"></a><br/> |
|
221 |
Software: <a href="https://doi.org/10.5281/zenodo.1441057"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.1441057.svg" alt="DOI"></a><br/> |
|
222 |
This dataset is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br/> |
|
223 |
DOIBoost is a metadata collection that enriches CrossRef with inputs from Microsoft Academic Graph, ORCID, and Unpaywall. |
|
224 |
</li> |
|
225 |
</ul> |
|
206 | 226 |
</div> |
227 |
<h3>Cite us</h3> |
|
228 |
<p>If you use any of the dumps above for research purposes, please cite it following the reccomendation that you find on the Zenodo page.<br/> |
|
229 |
The OpenAIRE Research Graph and DOIBoost include data from <a href="https://aka.ms/msracad">Microsoft Academic Graph</a> (MAG): please acknowledge also MAG following <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq#license">this guideline</a>.<br/> |
|
230 |
</p> |
|
207 | 231 |
|
232 |
<h3>Still using the old XML dumps?</h3> |
|
208 | 233 |
<div> |
209 |
<p>The dumps contain XML records compliant to the <b>OpenAIRE data model</b> and to the <b>oaf metadata format</b> (the same format of the records exported via <a href="./oai-pmh.html">OAI-PMH</a>):</p> |
|
210 |
<ul> |
|
211 |
<li><a href="" target="_blank">See the description of the OpenAIRE data model</a></li> |
|
212 |
<li><a href="https://www.openaire.eu/schema/latest/oaf.xsd" target="_blank">See the oaf XML schema</a></li> |
|
213 |
<li><a href="https://www.openaire.eu/schema/latest/doc/oaf.html" target="_blank">See the oaf XML schema documentation (generated via Oxygen XML Editor)</a></li> |
|
214 |
</ul> |
|
215 |
<p>Keep reading for instructions on how to consume the dumps.</p> |
|
234 |
Please migrate to the new json dumps. Meanwhile, you can still access the <a href="./graph-dumps-old.html">old documentation here</a>. |
|
216 | 235 |
</div> |
217 | 236 |
|
218 |
<h3>Consume the dumps</h3> |
|
219 |
<div> |
|
220 |
Each dump is a gzipped json file with many lines. Each line is in the form of: |
|
221 |
<code>{"_id":{"$oid":"59b82504895be144859a9804"},"body":{"$binary":"base64(zip(XML_record))","$type":"00"}}</code><br/> |
|
222 |
where the <code>body</code> field contains the base64 econding of the compressed XML record. <br/> |
|
223 |
In order to get the XMLs you have to: |
|
224 |
<ol> |
|
225 |
<li>Unzip the file</li> |
|
226 |
<li>Get only the value of the <code>$binary</code> field</li> |
|
227 |
<li>Read each line and base64 decode it</li> |
|
228 |
<li>Unzip the decoded string</li> |
|
229 |
</ol> |
|
230 |
|
|
231 |
For example, to print the XMLs on the standard output you can run this command on MacOS/Unix/Linux based systems: |
|
232 |
<code>gunzip -c file.json.gz | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done </code><br/> |
|
233 |
where |
|
234 |
<ul> |
|
235 |
<li><code>file.json.gz</code> is the name you gave to the downloaded file dump;</li> |
|
236 |
<li><code>jq</code> is a command to parse json files. It is not installed by default, but you can easy find it on official repositories. <a href="https://stedolan.github.io/jq/download/">Click here for installation instructions</a>. |
|
237 |
<li><code>base64</code> and <code>bsdtar</code> are two libraries that are typically pre-installed.</li> |
|
238 |
</ul> |
|
239 |
Note that you should decide what to do with it (keep parsing XML inline or store them somewhere). |
|
240 |
We suggest to start with few records to test and decide what to do, by adding a <code>head</code> command after the <code>gunzip</code>, like: |
|
241 |
<code>gunzip -c file.json.gz | head -n 10 | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done</code> |
|
242 |
</div> |
|
243 |
|
|
244 |
<h3>Cite us</h3> |
|
245 |
<p>If you use the OpenAIRE Research Graph for research purposes, please cite it as:<br/> |
|
246 |
<i>Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516917</i><br/> |
|
247 |
If you want to cite a specific version, please follow the suggestion on Zenodo. For the current version (1.0.0-beta), please use: </br> |
|
248 |
<i>Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump (Version 1.0.0-beta) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516918</i><br/> |
|
249 |
The OpenAIRE Research graph includes data from <a href="https://aka.ms/msracad">Microsoft Academic Graph</a> (MAG): please acknowledge also MAG following <a href="https://docs.microsoft.com/en-us/academic-services/graph/resources-faq#license">this guideline</a>. |
|
250 |
</p> |
|
251 |
<h3>License</h3> |
|
252 |
<p>The OpenAIRE Research Graph is released under CC-BY license.</p> |
|
253 |
<p>OpenAIRE is working to produce dumps that only contains metadata records that can be re-distributed with the CC0 license: stay tuned!</p> |
|
254 | 237 |
</div> |
255 | 238 |
</div> |
256 | 239 |
</div> |
Also available in: Unified diff
Updated documentation on OpenAIRE Research Graph dumps