129,137 research outputs found
Theory and Practice of Data Citation
Citations are the cornerstone of knowledge propagation and the primary means
of assessing the quality of research, as well as directing investments in
science. Science is increasingly becoming "data-intensive", where large volumes
of data are collected and analyzed to discover complex patterns through
simulations and experiments, and most scientific reference works have been
replaced by online curated datasets. Yet, given a dataset, there is no
quantitative, consistent and established way of knowing how it has been used
over time, who contributed to its curation, what results have been yielded or
what value it has.
The development of a theory and practice of data citation is fundamental for
considering data as first-class research objects with the same relevance and
centrality of traditional scientific products. Many works in recent years have
discussed data citation from different viewpoints: illustrating why data
citation is needed, defining the principles and outlining recommendations for
data citation systems, and providing computational methods for addressing
specific issues of data citation.
The current panorama is many-faceted and an overall view that brings together
diverse aspects of this topic is still missing. Therefore, this paper aims to
describe the lay of the land for data citation, both from the theoretical (the
why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association
for Information Science and Technology (JASIST), 201
Forecasting the Spreading of Technologies in Research Communities
Technologies such as algorithms, applications and formats are an important part of the knowledge produced and reused in the research process. Typically, a technology is expected to originate in the context of a research area and then spread and contribute to several other fields. For example, Semantic Web technologies have been successfully adopted by a variety of fields, e.g., Information Retrieval, Human Computer Interaction, Biology, and many others. Unfortunately, the spreading of technologies across research areas may be a slow and inefficient process, since it is easy for researchers to be unaware of potentially relevant solutions produced by other research communities. In this paper, we hypothesise that it is possible to learn typical technology propagation patterns from historical data and to exploit this knowledge i) to anticipate where a technology may be adopted next and ii) to alert relevant stakeholders about emerging and relevant technologies in other fields. To do so, we propose the Technology-Topic Framework, a novel approach which uses a semantically enhanced technology-topic model to forecast the propagation of technologies to research areas. A formal evaluation of the approach on a set of technologies in the Semantic Web and Artificial Intelligence areas has produced excellent results, confirming the validity of our solution
SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications
We describe the SemEval task of extracting keyphrases and relations between
them from scientific documents, which is crucial for understanding which
publications describe which processes, tasks and materials. Although this was a
new task, we had a total of 26 submissions across 3 evaluation scenarios. We
expect the task and the findings reported in this paper to be relevant for
researchers working on understanding scientific content, as well as the broader
knowledge base population and information extraction communities
An Analysis of Publication Venues for Automatic Differentiation Research
We present the results of our analysis of publication venues for papers on
automatic differentiation (AD), covering academic journals and conference
proceedings. Our data are collected from the AD publications database
maintained by the autodiff.org community website. The database is purpose-built
for the AD field and is expanding via submissions by AD researchers. Therefore,
it provides a relatively noise-free list of publications relating to the field.
However, it does include noise in the form of variant spellings of journal and
conference names. We handle this by manually correcting and merging these
variants under the official names of corresponding venues. We also share the
raw data we get after these corrections.Comment: 6 pages, 3 figure
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
"Seed+Expand": A validated methodology for creating high quality publication oeuvres of individual researchers
The study of science at the individual micro-level frequently requires the
disambiguation of author names. The creation of author's publication oeuvres
involves matching the list of unique author names to names used in publication
databases. Despite recent progress in the development of unique author
identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key
problem when it comes to large-scale bibliometric analysis using data from
multiple databases. This study introduces and validates a new methodology
called seed+expand for semi-automatic bibliographic data collection for a given
set of individual authors. Specifically, we identify the oeuvre of a set of
Dutch full professors during the period 1980-2011. In particular, we combine
author records from the National Research Information System (NARCIS) with
publication records from the Web of Science. Starting with an initial list of
8,378 names, we identify "seed publications" for each author using five
different approaches. Subsequently, we "expand" the set of publication in three
different approaches. The different approaches are compared and resulting
oeuvres are evaluated on precision and recall using a "gold standard" dataset
of authors for which verified publications in the period 2001-2010 are
available.Comment: Paper accepted for the ISSI 2013, small changes in the text due to
referee comments, one figure added (Fig 3
Scientific Information Extraction with Semi-supervised Neural Tagging
This paper addresses the problem of extracting keyphrases from scientific
articles and categorizing them as corresponding to a task, process, or
material. We cast the problem as sequence tagging and introduce semi-supervised
methods to a neural tagging model, which builds on recent advances in named
entity recognition. Since annotated training data is scarce in this domain, we
introduce a graph-based semi-supervised algorithm together with a data
selection scheme to leverage unannotated articles. Both inductive and
transductive semi-supervised learning strategies outperform state-of-the-art
information extraction performance on the 2017 SemEval Task 10 ScienceIE task.Comment: accepted by EMNLP 201
- …