134 research outputs found
Applying Reliability Metrics to Co-Reference Annotation
Studies of the contextual and linguistic factors that constrain discourse
phenomena such as reference are coming to depend increasingly on annotated
language corpora. In preparing the corpora, it is important to evaluate the
reliability of the annotation, but methods for doing so have not been readily
available. In this report, I present a method for computing reliability of
coreference annotation. First I review a method for applying the information
retrieval metrics of recall and precision to coreference annotation proposed by
Marc Vilain and his collaborators. I show how this method makes it possible to
construct contingency tables for computing Cohen's Kappa, a familiar
reliability metric. By comparing recall and precision to reliability on the
same data sets, I also show that recall and precision can be misleadingly high.
Because Kappa factors out chance agreement among coders, it is a preferable
measure for developing annotated corpora where no pre-existing target
annotation exists.Comment: 10 pages, 2-column format; uuencoded, gzipped, tarfil
Recommended from our members
Evaluating an Evaluation Method: The Pyramid Method Applied to 2003 Document Understanding Conference (DUC) Data
A pyramid evaluation dataset was created for DUC 2003 in order to compare results with DUC 2005, and to provide an independent test of the evaluation metric. The main differences between DUC 2003 and 2005 datasets pertain to the document length, cluster sizes, and model summary length. For five of the DUC 2003 document sets, two pyramids each were constructed by annotators working independently. Scores of the same peer using different pyramids were highly correlated. Sixteen systems were evaluated on eight document sets. Analysis of variance using Tukey's Honest Significant Difference method showed significant differences among all eight document sets, and more significant differences among the sixteen systems than for DUC 2005
Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to
Recommended from our members
Evaluating Content Selection in Summarization: The Pyramid Method
We present an empirically grounded method for evaluating content selection in summarization. It incorporates the idea that no single best model summary for a collection of documents exists. Our method quantifies the relative importance of facts to be conveyed. We argue that it is reliable, predictive and diagnostic, thus improves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference
Recommended from our members
A WOz Variant with Contrastive Conditions
We present a variant of the WOz paradigm we refer to as incremental ablation. The new feature involves incrementally restricting the human wizard’s capacities in the direction of a dialog system. We lay out a data collection design with six conditions of user-system and user-wizard interactions that allows us to more precisely identify how to close the communication gap between humans and systems. We describe the application of the method to analysis of contexts in which ASR errors occur, giving us a means to investigate the problem solving strategies humans would resort to if their communication channel were restricted to be more like the machine’s. We describe how we can use the methodology to collect data that is more relevant to a particular learning paradigm involving Markov Decision Processes (MDP)
Evaluating Content Selection in Human- or Machine-Generated Summaries: The Pyramid Scoring Method
From the outset of automated generation of summaries, the difficulty of evaluation has been widely discussed. Despite many promising attempts, we believe it remains an unsolved problem. Here we present a method for scoring the content of summaries of any length against a weighted inventory of content units, which we refer to as a pyramid. Our method is derived from empirical analysis of human-generated summaries, and provides an informative metric for human or machine-generated summaries
Recommended from our members
The Pyramid Method: Incorporating human content selection variation in summarization evaluation.
Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and informative? In this article, we address these very questions by proposing a method for analysis of multiple human abstracts into semantic content units. Such analysis allows us not only to quantify human variation in content selection, but also to assign empirical importance weight to different content units. It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation methods
- …