Search CORE

35 research outputs found

A Bayesian hybrid method for context-sensitive spelling correction

Author: Golding Andrew R.
Publication venue
Publication date: 01/01/1995
Field of study

Two classes of methods have been shown to be useful for resolving lexical ambiguity. The first relies on the presence of particular words within some distance of the ambiguous target word; the second uses the pattern of words and part-of-speech tags around the target word. These methods have complementary coverage: the former captures the lexical ``atmosphere'' (discourse topic, tense, etc.), while the latter captures local syntax. Yarowsky has exploited this complementarity by combining the two methods using decision lists. The idea is to pool the evidence provided by the component methods, and to then solve a target problem by applying the single strongest piece of evidence, whatever type it happens to be. This paper takes Yarowsky's work as a starting point, applying decision lists to the problem of context-sensitive spelling correction. Decision lists are found, by and large, to outperform either component method. However, it is found that further improvements can be obtained by taking into account not just the single strongest piece of evidence, but ALL the available evidence. A new hybrid method, based on Bayesian classifiers, is presented for doing this, and its performance improvements are demonstrated.Comment: 15 page

arXiv.org e-Print Archive

CiteSeerX

Fifty years of spellchecking

Author: Blair CR
Brooks G
Carlson AJ
Cucerzan S
Damerau FJ
Damerau FJ
Golding AR
Golding AR
Leech G
Levenshtein VI
McIlroy MD
Mihov S
Mitton R
Mitton R
Mitton R
Mitton R
Morris R
Oflazer K
Pedler J
Peterson JL
Peterson JL
Pollock JL
Roger Mitton
Savary A
Sterling CM
Veronis J
Wagner RA
Wing AM
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2010
Field of study

A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases

Crossref

Birkbeck Institutional Research Online

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

Author: Chang Jason J. S.
Chen Shun-Der
Publication venue: City University of Hong Kong
Publication date: 01/01/1995
Field of study

Waseda University Repository

Applying Winnow to Context-Sensitive Spelling Correction

Author: Golding Andrew R.
Roth Dan
Publication venue
Publication date: 01/01/1996
Field of study

Multiplicative weight-updating algorithms such as Winnow have been studied extensively in the COLT literature, but only recently have people started to use them in applications. In this paper, we apply a Winnow-based algorithm to a task in natural language: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so on. Previous approaches to this problem have been statistics-based; we compare Winnow to one of the more successful such approaches, which uses Bayesian classifiers. We find that: (1)~When the standard (heavily-pruned) set of features is used to describe problem instances, Winnow performs comparably to the Bayesian method; (2)~When the full (unpruned) set of features is used, Winnow is able to exploit the new features and convincingly outperform Bayes; and (3)~When a test set is encountered that is dissimilar to the training set, Winnow is better than Bayes at adapting to the unfamiliar test set, using a strategy we will present for combining learning on the training set with unsupervised learning on the (noisy) test set.Comment: 9 page

arXiv.org e-Print Archive

CiteSeerX

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors

Author: Mitton Roger
Pedler Jennifer
Publication venue
Publication date: 01/01/2010
Field of study

One of the methods that has been proposed for dealing with real-word errors (errors that occur when a correctly spelled word is substituted for the one intended) is the "confusion-set" approach - a confusion set being a small group of words that are likely to be confused with one another. Using a list of confusion sets drawn up in advance, a spellchecker, on finding one of these words in a text, can assess whether one of the other members of its set would be a better fit and, if it appears to be so, propose that word as a correction. Much of the research using this approach has suffered from two weaknesses. The first is the small number of confusion sets used. The second is that systems have largely been tested on artificial errors. In this paper we address these two weaknesses. We describe the creation of a realistically sized list of confusion sets, then the assembling of a corpus of real-word errors, and then we assess the potential of that list in relation to that corpus

CiteSeerX

Birkbeck Institutional Research Online

Learning to Resolve Natural Language Ambiguities: A Unified Approach

Author: Roth Dan
Publication venue
Publication date: 01/01/1998
Field of study

We analyze a few of the commonly used statistics based and machine learning algorithms for natural language disambiguation tasks and observe that they can be re-cast as learning linear separators in the feature space. Each of the methods makes a priori assumptions, which it employs, given the data, when searching for its hypothesis. Nevertheless, as we show, it searches a space that is as rich as the space of all linear separators. We use this to build an argument for a data driven approach which merely searches for a good linear separator in the feature space, without further assumptions on the domain or a specific problem. We present such an approach - a sparse network of linear separators, utilizing the Winnow learning algorithm - and show how to use it in a variety of ambiguity resolution problems. The learning approach presented is attribute-efficient and, therefore, appropriate for domains having very large number of attributes. In particular, we present an extensive experimental comparison of our approach with other methods on several well studied lexical disambiguation tasks such as context-sensitive spelling correction, prepositional phrase attachment and part of speech tagging. In all cases we show that our approach either outperforms other methods tried for these tasks or performs comparably to the best

arXiv.org e-Print Archive

CiteSeerX

Large scale experiments on correction of confused words

Author: Huang Jin Hu
Powers David Martin
Publication venue: Institute of Electrical and Electronics Engineers Computer Society (IEEE Publishing)
Publication date: 01/01/2001
Field of study

The paper describes a new approach to automatically learn contextual knowledge for spelling and grammar correction; we aim particularly to deal with cases where the words are all in the dictionary and so it is not obvious that there is an error. Traditional approaches are dictionary based, or use elementary tagging or partial parsing of the sentence to obtain context knowledge. Our approach uses affix information and only the most frequent words to reduce the complexity in terms of training time and running time for context-sensitive spelling correction. We build large scale confused word sets based on keyboard adjacency and apply our new approach to learn the contextual knowledge to detect and correct them. We explore the performance of auto-correction under conditions where significance and probability are set by the user

CiteSeerX

Flinders Academic Commons

Інверсний контекстно-асоціативний метод автоматизованої орфокорекції

Author: Заболотня Т.М.
Михайлюк А.Ю.
Михайлюк О.С.
Publication venue: Інститут проблем штучного інтелекту МОН України та НАН України
Publication date: 01/01/2008
Field of study

Теоретично обґрунтовано та запропоновано інверсний контекстно-асоціативний метод автоматизованого виправлення орфографічних помилок, який забезпечує підвищення швидкості та точності роботи відповідного програмного забезпечення. Дано визначення показника результативності функціонування орфокоректора – точності його роботи. Показана ефективність застосування запропонованого методу для виправлення орфографічних помилок у масиві гетерогенних словосполучень за критеріями швидкості та точності корекції.Теоретически обусловлен и предложен инверсионный контекстно-ассоциативный метод автоматического исправления орфографических ошибок, который обеспечивает повышение скорости и точности работы соответствующего программного обеспечения. Дано определение показателя результативности функционирования орфокорректора – точности его работы. Показана эффективность использования предлагаемого метода для исправления орфографических ошибок в массиве гетерогенных словосочетаний по критериям скорости и точности коррекции

Наукова електронна бібліотека періодичних видань НАН України (Vernadsky National Library of Ukraine)