Search CORE

7 research outputs found

A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication

Author: Lee Lillian
Tan Chenhao
Publication venue
Publication date: 01/01/2014
Field of study

The strength with which a statement is made can have a significant impact on the audience. For example, international relations can be strained by how the media in one country describes an event in another; and papers can be rejected because they overstate or understate their findings. It is thus important to understand the effects of statement strength. A first step is to be able to distinguish between strong and weak statements. However, even this problem is understudied, partly due to a lack of data. Since strength is inherently relative, revisions of texts that make claims are a natural source of data on strength differences. In this paper, we introduce a corpus of sentence-level revisions from academic writing. We also describe insights gained from our annotation efforts for this task.Comment: 6 pages, to appear in Proceedings of ACL 2014 (short paper

arXiv.org e-Print Archive

CiteSeerX

Crossref

Detection is the central problem in real-word spelling correction

Author: Wilcox-O'Hearn L. Amber
Publication venue
Publication date: 15/08/2014
Field of study

Real-word spelling correction differs from non-word spelling correction in its aims and its challenges. Here we show that the central problem in real-word spelling correction is detection. Methods from non-word spelling correction, which focus instead on selection among candidate corrections, do not address detection adequately, because detection is either assumed in advance or heavily constrained. As we demonstrate in this paper, merely discriminating between the intended word and a random close variation of it within the context of a sentence is a task that can be performed with high accuracy using straightforward models. Trigram models are sufficient in almost all cases. The difficulty comes when every word in the sentence is a potential error, with a large set of possible candidate corrections. Despite their strengths, trigram models cannot reliably find true errors without introducing many more, at least not when used in the obvious sequential way without added structure. The detection task exposes weakness not visible in the selection task

arXiv.org e-Print Archive

CiteSeerX

Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History

Author: Torsten Zesch
Publication venue
Publication date: 01/04/2012
Field of study

We evaluate measures of contextual fitness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artificially created errors. In particular, the precision of statistical methods has been largely over-estimated, while the precision of knowledge-based approaches has been under-estimated. Additionally, we show that knowledge-based approaches can be improved by using semantic relatedness measures that make use of knowledge beyond classical taxonomic relations. Finally, we show that statistical and knowledgebased methods can be combined for increased performance.

CiteSeerX

TUbiblio