Search CORE

32 research outputs found

GERBIL: General Entity Annotator Benchmarking Framework

Author: Baron Ciro
Both Andreas
Brümmer Martin
Ceccarelli Diego
Cherix Didier
CORNOLTI MARCO
Eickmann Bernd
FERRAGINA PAOLO
Lemke Christiane
Moro Andrea
Navigli Roberto
Ngonga Ngomo Axel Cyrille
PICCINNO FRANCESCO
Rizzo Giuseppe
Röder Michael
Sack Harald
Speck René
Troncy Raphaël
Usbeck Ricardo
Waitelonis Jörg
Wesemann Lars
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results

MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

Author: Bryl Volha
Brümmer Martin
Consoli Sergio
Cucerzan Silviu
Devi Pooja
Erp Marieke Van
Ferreira Thiago Castro
Hoffart Johannes
Juan
Luo Gang
Nuzzolese Andrea-Giovanni
Röder Michael
Steinmetz Nadine
van Erp Marieke
Zhang Lei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/10/2017
Field of study

Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

arXiv.org e-Print Archive

KORE 50 $^{DYWC}$ : An evaluation data set for entity linking based on dbpedia, YAGO, wikidata, and crunchbase

Author: Färber Michael
Mix Rico
Noullet Kristian
Publication venue: European Language Resources Association
Publication date: 07/12/2020
Field of study

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

Author: Auer Sören
Brack Arthur
D'Souza Jennifer
Ewerth Ralph
Hoppe Anett
Jaradeh Mohamad Yaser
Publication venue
Publication date: 01/01/2020
Field of study

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.Comment: Published in LREC 2020. Publication URL https://www.aclweb.org/anthology/2020.lrec-1.268/; Dataset DOI https://doi.org/10.25835/001754

arXiv.org e-Print Archive

Repositorium für Naturwissenschaften und Technik

Entities as topic labels : combining entity linking and labeled LDA to improve topic interpretability and evaluability

Author: Lauscher Anne
Nanni Federico
Ponzetto Simone Paolo
Ruiz Fabo Pablo
Publication venue: Accademia University Press
Publication date: 01/01/2016
Field of study

Digital humanities scholars strongly need a corpus exploration method that provides topics easier to interpret than standard LDA topic models. To move towards this goal, here we propose a combination of two techniques, called Entity Linking and Labeled LDA. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability and support the quantitative evaluation of the obtained results. We illustrate the potential of the approach by applying it to three datasets, namely the transcription of speeches from the European Parliament fifth mandate, the Enron Corpus and the Hillary Clinton Email Dataset. While some of these resources have already been adopted by the natural language processing community, they still hold a large potential for humanities scholars, part of which could be exploited in studies that will adopt the fine-grained exploration method presented in this paper

MAnnheim DOCument Server