46,498 research outputs found
A Winnow-Based Approach to Context-Sensitive Spelling Correction
A large class of machine-learning problems in natural language require the
characterization of linguistic context. Two characteristic properties of such
problems are that their feature space is of very high dimensionality, and their
target concepts refer to only a small subset of the features in the space.
Under such conditions, multiplicative weight-update algorithms such as Winnow
have been shown to have exceptionally good theoretical properties. We present
an algorithm combining variants of Winnow and weighted-majority voting, and
apply it to a problem in the aforementioned class: context-sensitive spelling
correction. This is the task of fixing spelling errors that happen to result in
valid words, such as substituting "to" for "too", "casual" for "causal", etc.
We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a
statistics-based method representing the state of the art for this task. We
find: (1) When run with a full (unpruned) set of features, WinSpell achieves
accuracies significantly higher than BaySpell was able to achieve in either the
pruned or unpruned condition; (2) When compared with other systems in the
literature, WinSpell exhibits the highest performance; (3) The primary reason
that WinSpell outperforms BaySpell is that WinSpell learns a better linear
separator; (4) When run on a test set drawn from a different corpus than the
training set was drawn from, WinSpell is better able than BaySpell to adapt,
using a strategy we will present that combines supervised learning on the
training set with unsupervised learning on the (noisy) test set.Comment: To appear in Machine Learning, Special Issue on Natural Language
Learning, 1999. 25 page
Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information
In computing, spell checking is the process of detecting and sometimes
providing spelling suggestions for incorrectly spelled words in a text.
Basically, a spell checker is a computer program that uses a dictionary of
words to perform spell checking. The bigger the dictionary is, the higher is
the error detection rate. The fact that spell checkers are based on regular
dictionaries, they suffer from data sparseness problem as they cannot capture
large vocabulary of words including proper names, domain-specific terms,
technical jargons, special acronyms, and terminologies. As a result, they
exhibit low error detection rate and often fail to catch major errors in the
text. This paper proposes a new context-sensitive spelling correction method
for detecting and correcting non-word and real-word errors in digital text
documents. The approach hinges around data statistics from Google Web 1T 5-gram
data set which consists of a big volume of n-gram word sequences, extracted
from the World Wide Web. Fundamentally, the proposed method comprises an error
detector that detects misspellings, a candidate spellings generator based on a
character 2-gram model that generates correction suggestions, and an error
corrector that performs contextual error correction. Experiments conducted on a
set of text documents from different domains and containing misspellings,
showed an outstanding spelling error correction rate and a drastic reduction of
both non-word and real-word errors. In a further study, the proposed algorithm
is to be parallelized so as to lower the computational cost of the error
detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences -
http://www.lacsc.or
Large scale experiments on correction of confused words
The paper describes a new approach to automatically learn contextual knowledge for spelling and grammar correction; we aim particularly to deal with cases where the words are all in the dictionary and so it is not obvious that there is an error. Traditional approaches are dictionary based, or use elementary tagging or partial parsing of the sentence to obtain context knowledge. Our approach uses affix information and only the most frequent words to reduce the complexity in terms of training time and running time for context-sensitive spelling correction. We build large scale confused word sets based on keyboard adjacency and apply our new approach to learn the contextual knowledge to detect and correct them. We explore the performance of auto-correction under conditions where significance and probability are set by the user
Spelling correction in the NLP system 'LOLITA: dictionary organisation and search algorithms
This thesis describes the design and implementation of a spelling correction system and associated dictionaries, for the Natural Language Processing System 'LOLITA'. The dictionary storage is based upon a trie (M-ary tree) data-structure. The design of the dictionary is described, and the way in which the data-structure is implemented is also discussed. The spelling correction system makes use of the trie structure in order to limit repetition and "garden path' searching. The spelling correction algorithms used are a variation on the 'reverse minimum edit-distance' technique. These algorithms have been modified in order to place more emphasis on generation in order of likelihood. The system will correct up to two simple errors {i.e. insertion, omission, substitution or transposition of characters) per word. The individual algorithms are presented in turn and their combination into a unified strategy to correct misspellings is demonstrated. The system was implemented in the programming language Haskell; a pure functional, class-based language, with non-strict semantics and polymorphic type-checking. The use of several features of this language, in particular lazy evaluation, and their corresponding advantages over more traditional languages are described. The dictionaries and spelling correcting facilities are in use in the LOLITA system. Issues pertaining to 'real word' error correction, arising from the system's use in an NLP context, axe also discussed
Applying Winnow to Context-Sensitive Spelling Correction
Multiplicative weight-updating algorithms such as Winnow have been studied
extensively in the COLT literature, but only recently have people started to
use them in applications. In this paper, we apply a Winnow-based algorithm to a
task in natural language: context-sensitive spelling correction. This is the
task of fixing spelling errors that happen to result in valid words, such as
substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so
on. Previous approaches to this problem have been statistics-based; we compare
Winnow to one of the more successful such approaches, which uses Bayesian
classifiers. We find that: (1)~When the standard (heavily-pruned) set of
features is used to describe problem instances, Winnow performs comparably to
the Bayesian method; (2)~When the full (unpruned) set of features is used,
Winnow is able to exploit the new features and convincingly outperform Bayes;
and (3)~When a test set is encountered that is dissimilar to the training set,
Winnow is better than Bayes at adapting to the unfamiliar test set, using a
strategy we will present for combining learning on the training set with
unsupervised learning on the (noisy) test set.Comment: 9 page
- …