32 research outputs found
GERBIL: General Entity Annotator Benchmarking Framework
We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results
MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach
Entity linking has recently been the subject of a significant body of
research. Currently, the best performing approaches rely on trained
mono-lingual models. Porting these approaches to other languages is
consequently a difficult endeavor as it requires corresponding training data
and retraining of the models. We address this drawback by presenting a novel
multilingual, knowledge-based agnostic and deterministic approach to entity
linking, dubbed MAG. MAG is based on a combination of context-based retrieval
on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data
sets and in 7 languages. Our results show that the best approach trained on
English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse
on datasets in other languages. MAG, on the other hand, achieves
state-of-the-art performance on English datasets and reaches a micro F-measure
that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc
The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
We introduce the STEM (Science, Technology, Engineering, and Medicine)
Dataset for Scientific Entity Extraction, Classification, and Resolution,
version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to
provide a benchmark for the evaluation of scientific entity extraction,
classification, and resolution tasks in a domain-independent fashion. It
comprises abstracts in 10 STEM disciplines that were found to be the most
prolific ones on a major publishing platform. We describe the creation of such
a multidisciplinary corpus and highlight the obtained findings in terms of the
following features: 1) a generic conceptual formalism for scientific entities
in a multidisciplinary scientific context; 2) the feasibility of the
domain-independent human annotation of scientific entities under such a generic
formalism; 3) a performance benchmark obtainable for automatic extraction of
multidisciplinary scientific entities using BERT-based neural models; 4) a
delineated 3-step entity resolution procedure for human annotation of the
scientific entities via encyclopedic entity linking and lexicographic word
sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic
links and lexicographic senses for our entities. Our findings cumulatively
indicate that human annotation and automatic learning of multidisciplinary
scientific concepts as well as their semantic disambiguation in a wide-ranging
setting as STEM is reasonable.Comment: Published in LREC 2020. Publication URL
https://www.aclweb.org/anthology/2020.lrec-1.268/; Dataset DOI
https://doi.org/10.25835/001754
Entities as topic labels : combining entity linking and labeled LDA to improve topic interpretability and evaluability
Digital humanities scholars strongly need a corpus exploration method that provides topics easier
to interpret than standard LDA topic models. To move towards this goal, here we propose a
combination of two techniques, called Entity Linking and Labeled LDA. Our method identifies
in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation
easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability and support
the quantitative evaluation of the obtained results. We illustrate the potential of the approach by
applying it to three datasets, namely the transcription of speeches from the European Parliament
fifth mandate, the Enron Corpus and the Hillary Clinton Email Dataset. While some of these
resources have already been adopted by the natural language processing community, they still
hold a large potential for humanities scholars, part of which could be exploited in studies that
will adopt the fine-grained exploration method presented in this paper