• Home
  • About
  • Repositories
  • Search
  • Web API
  • Feedback
<< Go Back

Metadata

Name
All Computer Science Papers @ arXiv.org -- A High-Quality Gold Standard for Citation-based Tasks
Repository
ZENODO
Identifier
doi:10.5281/zenodo.3535002
Description
We propose a newly-created gold standard data set for citation-based tasks. This gold standard is based on all computer science papers in arXiv.org.

Abstract. Analyzing and recommending citations with their specific citation contexts have recently received much attention due to the growing number of available publications. Although data sets such as CiteSeerX have been created for evaluating approaches for such tasks, those data sets exhibit striking defects. This is understandable if one considers that both information extraction and entity linking as well as entity resolution need to be performed. In this paper, we propose a new evaluation data set for citation-dependent tasks based on arXiv.org publications. Our data set is characterized by the fact that it exhibits almost zero noise in the extracted content and that all citations are linked to their correct publications. Besides the pure content, available on a sentence-basis, cited publications are annotated directly in the text via global identifiers. As far as possible, referenced publications are further linked to DBLP. Our data set consists of over 15M sentences and is freely available for research purposes. It can be used for training and testing citation-based tasks, such as recommending citations, determining the functions or importance of citations, and summarizing documents based on their citations.

&nbsp;

More information can be found in our publication &quot;A High-Quality Gold Standard for Citation-based Tasks&quot; (LREC&#39;18).

You can cite the data set as follows:

@inproceedings{DBLP:conf/lrec/0001TJ18,
author = {Michael F{\"{a}}rber and
Alexander Thiemann and
Adam Jatowt},
title = "{A High-Quality Gold Standard for Citation-based Tasks}",
booktitle = "{Proceedings of the Eleventh International Conference on Language Resources
and Evaluation}",
series = "{LREC'18}",
location = "{Miyazaki, Japan}",
year = {2018},
url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/283.html}
}


&nbsp;
Data or Study Types
multiple
Source Organization
Unknown
Access Conditions
available
Year
2019
Access Hyperlink
https://doi.org/10.5281/zenodo.3535002

Distributions

  • Encoding Format: HTML ; URL: https://doi.org/10.5281/zenodo.3535002
This project was funded in part by grant U24AI117966 from the NIH National Institute of Allergy and Infectious Diseases as part of the Big Data to Knowledge program. We thank all members of the bioCADDIE community for their valuable input on the overall project.