12 research outputs found
unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata
In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards.
Here, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers' plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches but also serve as a basis for new ways to analyze in-text citations.
See https://github.com/IllDepence/unarXive for the source code which has been used for creating the data set.
For citing our data set and for further information we can refer to our journal article
Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020, http://dx.doi.org/10.1007/s11192-020-03382-z
Cross-lingual citations in English papers: a large-scale analysis of prevalence, usage, and impact
Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available
CoCon: A Data Set on Combined Contextualized Research Artifact Use
In the wake of information overload in academia, methodologies and systems
for search, recommendation, and prediction to aid researchers in identifying
relevant research are actively studied and developed. Existing work, however,
is limited in terms of granularity, focusing only on the level of papers or a
single type of artifact, such as data sets. To enable more holistic analyses
and systems dealing with academic publications and their content, we propose
CoCon, a large scholarly data set reflecting the combined use of research
artifacts, contextualized in academic publications' full-text. Our data set
comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k
publications. We additionally formalize a link prediction task for "combined
research artifact use prediction" and provide code to utilize analyses of and
the development of ML applications on our data. All data and code is publicly
available at https://github.com/IllDepence/contextgraph.Comment: submitted to JCDL202
How does author affiliation affect preprint citation count? Analyzing citation bias at the institution and country level
Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts)
Sequence Labeling for Citation Field Extraction from Cyrillic Script References
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data
unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network
Large-scale data sets on scholarly publications are the basis for a variety
of bibliometric analyses and natural language processing (NLP) applications.
Especially data sets derived from publication's full-text have recently gained
attention. While several such data sets already exist, we see key shortcomings
in terms of their domain and time coverage, citation network completeness, and
representation of full-text content. To address these points, we propose a new
version of the data set unarXive. We base our data processing pipeline and
output format on two existing data sets, and improve on each of them. Our
resulting data set comprises 1.9 M publications spanning multiple disciplines
and 32 years. It furthermore has a more complete citation network than its
predecessors and retains a richer representation of document structure as well
as non-textual publication content such as mathematical notation. In addition
to the data set, we provide ready-to-use training/test data for citation
recommendation and IMRaD classification. All data and source code is publicly
available at https://github.com/IllDepence/unarXive.Comment: submitted to JCDL202