85 research outputs found
Applying Reliability Metrics to Co-Reference Annotation
Studies of the contextual and linguistic factors that constrain discourse
phenomena such as reference are coming to depend increasingly on annotated
language corpora. In preparing the corpora, it is important to evaluate the
reliability of the annotation, but methods for doing so have not been readily
available. In this report, I present a method for computing reliability of
coreference annotation. First I review a method for applying the information
retrieval metrics of recall and precision to coreference annotation proposed by
Marc Vilain and his collaborators. I show how this method makes it possible to
construct contingency tables for computing Cohen's Kappa, a familiar
reliability metric. By comparing recall and precision to reliability on the
same data sets, I also show that recall and precision can be misleadingly high.
Because Kappa factors out chance agreement among coders, it is a preferable
measure for developing annotated corpora where no pre-existing target
annotation exists.Comment: 10 pages, 2-column format; uuencoded, gzipped, tarfil
Evaluating Content Selection in Human- or Machine-Generated Summaries: The Pyramid Scoring Method
From the outset of automated generation of summaries, the difficulty of evaluation has been widely discussed. Despite many promising attempts, we believe it remains an unsolved problem. Here we present a method for scoring the content of summaries of any length against a weighted inventory of content units, which we refer to as a pyramid. Our method is derived from empirical analysis of human-generated summaries, and provides an informative metric for human or machine-generated summaries
Abstractive Multi-Document Summarization via Phrase Selection and Merging
We propose an abstraction-based multi-document summarization framework that
can construct new sentences by exploring more fine-grained syntactic units than
sentences, namely, noun/verb phrases. Different from existing abstraction-based
approaches, our method first constructs a pool of concepts and facts
represented by phrases from the input documents. Then new sentences are
generated by selecting and merging informative phrases to maximize the salience
of phrases and meanwhile satisfy the sentence construction constraints. We
employ integer linear optimization for conducting phrase selection and merging
simultaneously in order to achieve the global optimal solution for a summary.
Experimental results on the benchmark data set TAC 2011 show that our framework
outperforms the state-of-the-art models under automated pyramid evaluation
metric, and achieves reasonably well results on manual linguistic quality
evaluation.Comment: 11 pages, 1 figure, accepted as a full paper at ACL 201
Modeling Weather Impact on a Secondary Electrical Grid
Weather can cause problems for underground electrical grids by increasing the probability of serious “manhole events” such as fires and explosions. In this work, we compare a model that incorporates weather features associated with the dates of serious events into a single logistic regression, with a more complex approach that has three interdependent log linear models for weather, baseline manhole vulnerability, and vulnerability of manholes to weather. The latter approach more naturally incorporates the dependencies between the weather, structure properties, and structure vulnerability
Survey on Sociodemographic Bias in Natural Language Processing
Deep neural networks often learn unintended biases during training, which
might have harmful effects when deployed in real-world settings. This paper
surveys 209 papers on bias in NLP models, most of which address
sociodemographic bias. To better understand the distinction between bias and
real-world harm, we turn to ideas from psychology and behavioral economics to
propose a definition for sociodemographic bias. We identify three main
categories of NLP bias research: types of bias, quantifying bias, and
debiasing. We conclude that current approaches on quantifying bias face
reliability issues, that many of the bias metrics do not relate to real-world
biases, and that current debiasing techniques are superficial and hide bias
rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur
Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation
We present a probabilistic model that uses both prosodic and lexical cues for
the automatic segmentation of speech into topically coherent units. We propose
two methods for combining lexical and prosodic information using hidden Markov
models and decision trees. Lexical information is obtained from a speech
recognizer, and prosodic features are extracted automatically from speech
waveforms. We evaluate our approach on the Broadcast News corpus, using the
DARPA-TDT evaluation metrics. Results show that the prosodic model alone is
competitive with word-based segmentation methods. Furthermore, we achieve a
significant reduction in error by combining the prosodic and word-based
knowledge sources.Comment: 27 pages, 8 figure
Annotation scheme for social network extraction from text
Abstract We are interested in extracting social networks from text. We present a novel annotation scheme for a new type of event, called social event, in which two people participate such that at least one of them is cognizant of the other. We compare our scheme in detail to the ACE scheme. We perform a detailed analysis of interannotator agreement, which shows that our annotations are reliable
- …