35 research outputs found
A Bayesian hybrid method for context-sensitive spelling correction
Two classes of methods have been shown to be useful for resolving lexical
ambiguity. The first relies on the presence of particular words within some
distance of the ambiguous target word; the second uses the pattern of words and
part-of-speech tags around the target word. These methods have complementary
coverage: the former captures the lexical ``atmosphere'' (discourse topic,
tense, etc.), while the latter captures local syntax. Yarowsky has exploited
this complementarity by combining the two methods using decision lists. The
idea is to pool the evidence provided by the component methods, and to then
solve a target problem by applying the single strongest piece of evidence,
whatever type it happens to be. This paper takes Yarowsky's work as a starting
point, applying decision lists to the problem of context-sensitive spelling
correction. Decision lists are found, by and large, to outperform either
component method. However, it is found that further improvements can be
obtained by taking into account not just the single strongest piece of
evidence, but ALL the available evidence. A new hybrid method, based on
Bayesian classifiers, is presented for doing this, and its performance
improvements are demonstrated.Comment: 15 page
Fifty years of spellchecking
A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases
Applying Winnow to Context-Sensitive Spelling Correction
Multiplicative weight-updating algorithms such as Winnow have been studied
extensively in the COLT literature, but only recently have people started to
use them in applications. In this paper, we apply a Winnow-based algorithm to a
task in natural language: context-sensitive spelling correction. This is the
task of fixing spelling errors that happen to result in valid words, such as
substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so
on. Previous approaches to this problem have been statistics-based; we compare
Winnow to one of the more successful such approaches, which uses Bayesian
classifiers. We find that: (1)~When the standard (heavily-pruned) set of
features is used to describe problem instances, Winnow performs comparably to
the Bayesian method; (2)~When the full (unpruned) set of features is used,
Winnow is able to exploit the new features and convincingly outperform Bayes;
and (3)~When a test set is encountered that is dissimilar to the training set,
Winnow is better than Bayes at adapting to the unfamiliar test set, using a
strategy we will present for combining learning on the training set with
unsupervised learning on the (noisy) test set.Comment: 9 page
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors
One of the methods that has been proposed for dealing with real-word errors (errors that occur when a correctly spelled word is substituted for the one intended) is the "confusion-set" approach - a confusion set being a small group of words that are likely to be confused with one another. Using a list of confusion sets drawn up in advance, a spellchecker, on finding one of these words in a text, can assess whether one of the other members of its set would be a better fit and, if it appears to be so, propose that word as a correction. Much of the research using this approach has suffered from two weaknesses. The first is the small number of confusion sets used. The second is that systems have largely been tested on artificial errors. In this paper we address these two weaknesses. We describe the creation of a realistically sized list of confusion sets, then the assembling of a corpus of real-word errors, and then we assess the potential of that list in relation to that corpus
Learning to Resolve Natural Language Ambiguities: A Unified Approach
We analyze a few of the commonly used statistics based and machine learning
algorithms for natural language disambiguation tasks and observe that they can
be re-cast as learning linear separators in the feature space. Each of the
methods makes a priori assumptions, which it employs, given the data, when
searching for its hypothesis. Nevertheless, as we show, it searches a space
that is as rich as the space of all linear separators. We use this to build an
argument for a data driven approach which merely searches for a good linear
separator in the feature space, without further assumptions on the domain or a
specific problem.
We present such an approach - a sparse network of linear separators,
utilizing the Winnow learning algorithm - and show how to use it in a variety
of ambiguity resolution problems. The learning approach presented is
attribute-efficient and, therefore, appropriate for domains having very large
number of attributes.
In particular, we present an extensive experimental comparison of our
approach with other methods on several well studied lexical disambiguation
tasks such as context-sensitive spelling correction, prepositional phrase
attachment and part of speech tagging. In all cases we show that our approach
either outperforms other methods tried for these tasks or performs comparably
to the best
Large scale experiments on correction of confused words
The paper describes a new approach to automatically learn contextual knowledge for spelling and grammar correction; we aim particularly to deal with cases where the words are all in the dictionary and so it is not obvious that there is an error. Traditional approaches are dictionary based, or use elementary tagging or partial parsing of the sentence to obtain context knowledge. Our approach uses affix information and only the most frequent words to reduce the complexity in terms of training time and running time for context-sensitive spelling correction. We build large scale confused word sets based on keyboard adjacency and apply our new approach to learn the contextual knowledge to detect and correct them. We explore the performance of auto-correction under conditions where significance and probability are set by the user
Інверсний контекстно-асоціативний метод автоматизованої орфокорекції
Теоретично обґрунтовано та запропоновано інверсний контекстно-асоціативний метод автоматизованого
виправлення орфографічних помилок, який забезпечує підвищення швидкості та точності роботи
відповідного програмного забезпечення. Дано визначення показника результативності функціонування
орфокоректора – точності його роботи. Показана ефективність застосування запропонованого методу для
виправлення орфографічних помилок у масиві гетерогенних словосполучень за критеріями швидкості та
точності корекції.Теоретически обусловлен и предложен инверсионный контекстно-ассоциативный метод автоматического
исправления орфографических ошибок, который обеспечивает повышение скорости и точности
работы соответствующего программного обеспечения. Дано определение показателя результативности
функционирования орфокорректора – точности его работы. Показана эффективность использования
предлагаемого метода для исправления орфографических ошибок в массиве гетерогенных словосочетаний
по критериям скорости и точности коррекции