7 research outputs found
A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication
The strength with which a statement is made can have a significant impact on
the audience. For example, international relations can be strained by how the
media in one country describes an event in another; and papers can be rejected
because they overstate or understate their findings. It is thus important to
understand the effects of statement strength. A first step is to be able to
distinguish between strong and weak statements. However, even this problem is
understudied, partly due to a lack of data. Since strength is inherently
relative, revisions of texts that make claims are a natural source of data on
strength differences. In this paper, we introduce a corpus of sentence-level
revisions from academic writing. We also describe insights gained from our
annotation efforts for this task.Comment: 6 pages, to appear in Proceedings of ACL 2014 (short paper
Detection is the central problem in real-word spelling correction
Real-word spelling correction differs from non-word spelling correction in
its aims and its challenges. Here we show that the central problem in real-word
spelling correction is detection. Methods from non-word spelling correction,
which focus instead on selection among candidate corrections, do not address
detection adequately, because detection is either assumed in advance or heavily
constrained. As we demonstrate in this paper, merely discriminating between the
intended word and a random close variation of it within the context of a
sentence is a task that can be performed with high accuracy using
straightforward models. Trigram models are sufficient in almost all cases. The
difficulty comes when every word in the sentence is a potential error, with a
large set of possible candidate corrections. Despite their strengths, trigram
models cannot reliably find true errors without introducing many more, at least
not when used in the obvious sequential way without added structure. The
detection task exposes weakness not visible in the selection task
Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History
We evaluate measures of contextual fitness on the task of detecting real-word spelling errors. For that purpose, we extract naturally occurring errors and their contexts from the Wikipedia revision history. We show that such natural errors are better suited for evaluation than the previously used artificially created errors. In particular, the precision of statistical methods has been largely over-estimated, while the precision of knowledge-based approaches has been under-estimated. Additionally, we show that knowledge-based approaches can be improved by using semantic relatedness measures that make use of knowledge beyond classical taxonomic relations. Finally, we show that statistical and knowledgebased methods can be combined for increased performance.