Search CORE

134 research outputs found

Applying Reliability Metrics to Co-Reference Annotation

Author: Passonneau Rebecca J.
Publication venue
Publication date: 01/01/1997
Field of study

Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available. In this report, I present a method for computing reliability of coreference annotation. First I review a method for applying the information retrieval metrics of recall and precision to coreference annotation proposed by Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables for computing Cohen's Kappa, a familiar reliability metric. By comparing recall and precision to reliability on the same data sets, I also show that recall and precision can be misleadingly high. Because Kappa factors out chance agreement among coders, it is a preferable measure for developing annotated corpora where no pre-existing target annotation exists.Comment: 10 pages, 2-column format; uuencoded, gzipped, tarfil

arXiv.org e-Print Archive

CiteSeerX

Columbia University Academic Commons

Recommended from our members

Evaluating an Evaluation Method: The Pyramid Method Applied to 2003 Document Understanding Conference (DUC) Data

Author: Passonneau Rebecca
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

A pyramid evaluation dataset was created for DUC 2003 in order to compare results with DUC 2005, and to provide an independent test of the evaluation metric. The main differences between DUC 2003 and 2005 datasets pertain to the document length, cluster sizes, and model summary length. For five of the DUC 2003 document sets, two pyramids each were constructed by annotators working independently. Scores of the same peer using different pyramids were highly correlated. Sixteen systems were evaluated on eight document sets. Analysis of variance using Tukey's Honest Significant Difference method showed significant differences among all eight document sets, and more significant differences among the sixteen systems than for DUC 2005

Columbia University Academic Commons

Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation

Author: Passonneau Rebecca
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to

CiteSeerX

Columbia University Academic Commons

Recommended from our members

Evaluating Content Selection in Summarization: The Pyramid Method

Author: Nenkova Ani
Passonneau Rebecca
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2004
Field of study

We present an empirically grounded method for evaluating content selection in summarization. It incorporates the idea that no single best model summary for a collection of documents exists. Our method quantifies the relative importance of facts to be conveyed. We argue that it is reliable, predictive and diagnostic, thus improves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference

Columbia University Academic Commons

Recommended from our members

A WOz Variant with Contrastive Conditions

Author: Levin Esther
Passonneau Rebecca
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

We present a variant of the WOz paradigm we refer to as incremental ablation. The new feature involves incrementally restricting the human wizard’s capacities in the direction of a dialog system. We lay out a data collection design with six conditions of user-system and user-wizard interactions that allows us to more precisely identify how to close the communication gap between humans and systems. We describe the application of the method to analysis of contexts in which ASR errors occur, giving us a means to investigate the problem solving strategies humans would resort to if their communication channel were restricted to be more like the machine’s. We describe how we can use the methodology to collect data that is more relevant to a particular learning paradigm involving Markov Decision Processes (MDP)

Columbia University Academic Commons

Evaluating Content Selection in Human- or Machine-Generated Summaries: The Pyramid Scoring Method

Author: Nenkova Ani
Passonneau Rebecca J.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2003
Field of study

From the outset of automated generation of summaries, the difficulty of evaluation has been widely discussed. Despite many promising attempts, we believe it remains an unsolved problem. Here we present a method for scoring the content of summaries of any length against a weighted inventory of content units, which we refer to as a pyramid. Our method is derived from empirical analysis of human-generated summaries, and provides an informative metric for human or machine-generated summaries

CiteSeerX

Columbia University Academic Commons

Recommended from our members

The Pyramid Method: Incorporating human content selection variation in summarization evaluation.

Author: Nenkova Ani
Passonneau Rebecca
McKeown Kathleen
Publication venue
Publication date: 01/01/2007
Field of study

Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and informative? In this article, we address these very questions by proposing a method for analysis of multiple human abstracts into semantic content units. Such analysis allows us not only to quantify human variation in content selection, but also to assign empirical importance weight to different content units. It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation methods

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University