Search CORE

15 research outputs found

All that glitters...: Interannotator agreement in natural language processing

Author: Borin Lars
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

Evaluation has emerged as a central concern in natural language processing (NLP) over the last few decades. Evaluation is done against a gold standard, a manually linguistically annotated dataset, which is assumed to provide the ground truth against which the accuracy of the NLP system can be assessed automatically. In this article, some methodological questions in connection with the creation of gold standard datasets are discussed, in particular (non-)expectations of linguistic expertise in annotators and the interannotator agreement measure standardly but unreflectedly used as a kind of quality index of NLP gold standards

Septentrio Academic Publishing

Increasing the Recall of Corpus Annotation Error Detection

Author: Boyd Adriane
Dickinson Markus
Meurers Detmar
Publication venue
Publication date: 01/01/2007
Field of study

Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 19-30. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

CiteSeerX

DSpace at Tartu University Library

LAUDATIO-Repository: Accessing a heterogeneous field of linguistic corpora with the help of an open access repository

Author: Krause Thomas
Lüdeling Anke
Odebrecht Carolin
Romary Laurent
Schirmbacher Peter
Zielke Dennis
Publication venue: HAL CCSD
Publication date: 08/07/2014
Field of study

International audienceAn open access to digital historical research data for historical linguistics enables a fruitful exchange of research sources and research methods. To achieve this goal the LAUDATIO-Repository provides a long-term open access to historical corpus linguistic data. By developing the LAUDATIO-Repository we also want to explore how to build repositories that are useful for a set of well defined communities but are also flexible enough to be used and extended to serve other communities not considered beforehand. Considering the user community's needs requires a clear understanding of the community's user scenarios and research

INRIA a CCSD electronic archive server

Building and querying parallel treebanks

Author: Martin Volk
Torsten Marek
Yvonne Samuelsson
Publication venue: Language Science Press
Publication date
Field of study

This paper describes our work on building a trilingual parallel treebank. We have annotated constituent structure trees from three text genres (a philosophy novel, economy reports and a technical user manual). Our parallel treebank includes word and phrase alignments. The alignment information was manually checked using a graphical tool that allows the annotator to view a pair of trees from parallel sentences. This tool comes with a powerful search facility which supersedes the expressivity of previous popular treebank query engines

ZENODO

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus

Author: Artstein R
Bristot A
Cavicchio F
Delogu F
Poesio M
Rodriguez KJ
Uryupina O
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2020
Field of study

Queen Mary Research Online

Graphical error mining for linguistic annotated corpora

Author: Thiele Gregor
Publication venue
Publication date: 01/01/2013
Field of study

Corpora contain linguistically annotated data. Producing these annotations is a complex process that easily leads to inconsistencies within the annotation. Since corpora are used to evaluate automatic language processing systems the evaluation may suffer when there are too many errors within the data. This thesis focuses on finding erroneous annotations within corpora. To detect sequence annotation errors within part-of-speech tags we implemented the algorithm introduced by Dickinson and Meurers (2003). Additionally for structured annotations we choose the approach shown in Boyd et al.(2008) that targets inconsistency within dependency structures. We designed and built a graphical user interface (GUI) that is easy to handle and user-friendly. Implementing state-of-the-art algorithms for error detection with an user-friendly interface increase the operation domain because the algorithms can be used by a wider audience without deeper knowledge of computers. It provides even non-expert users with the capability to find inconsistent pos tags and dependency structures within a corpus. We evaluate the system using the German TIGER corpus and the English Penn Treebank. For the TIGER corpus we also perform a manual evaluation where we sample 115 6-grams and check manually if these contain errors. We find that 94.96% are erroneous and it is easy to decide the correct tag as a human. For 4.20% we can say that these are errors but determining the correct tag is very to difficult. In total we detect errors with a precision of 99.16%. Only one case (0.84%) is not caused by inconsistency but constitutes genuine ambiguity

Recommended from our members

Inducing grammars from linguistic universals and realistic amounts of supervision

Author: Garrette Daniel Hunter
Publication venue
Publication date: 20/01/2017
Field of study

The best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.Computer Science

Texas ScholarWorks

Improving Text Classification Accuracy by Training Label Cleaning

Author: Abney S.
Andrea Esuli
Brodley C. E.
Fabrizio Sebastiani
Freund Y.
Grady C.
Hersh W.
John G. H.
Maclin R.
Schapire R. E.
Shinnou H.
Snow R.
Yih W.-T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Annotation, exploitation and evaluation of parallel corpora

Author
Publication venue: Language Science Press
Publication date: 01/04/2020
Field of study

Exchange between the translation studies and the computational linguistics communities has traditionally not been very intense. Among other things, this is reflected by the different views on parallel corpora. While computational linguistics does not always strictly pay attention to the translation direction (e.g. when translation rules are extracted from (sub)corpora which actually only consist of translations), translation studies are amongst other things concerned with exactly comparing source and target texts (e.g. to draw conclusions on interference and standardization effects). However, there has recently been more exchange between the two fields – especially when it comes to the annotation of parallel corpora. This special issue brings together the different research perspectives. Its contributions show – from both perspectives – how the communities have come to interact in recent years

Directory of Open Access Books (DOAB)