10 research outputs found

    A large list of confusion sets for spellchecking assessed against a corpus of real-word errors

    Get PDF
    One of the methods that has been proposed for dealing with real-word errors (errors that occur when a correctly spelled word is substituted for the one intended) is the "confusion-set" approach - a confusion set being a small group of words that are likely to be confused with one another. Using a list of confusion sets drawn up in advance, a spellchecker, on finding one of these words in a text, can assess whether one of the other members of its set would be a better fit and, if it appears to be so, propose that word as a correction. Much of the research using this approach has suffered from two weaknesses. The first is the small number of confusion sets used. The second is that systems have largely been tested on artificial errors. In this paper we address these two weaknesses. We describe the creation of a realistically sized list of confusion sets, then the assembling of a corpus of real-word errors, and then we assess the potential of that list in relation to that corpus

    Multidimensional Pareto optimization of touchscreen keyboards for speed, familiarity and improved spell checking

    Get PDF
    The paper presents a new optimization technique for keyboard layouts based on Pareto front optimization. We used this multifactorial technique to create two new touchscreen phone keyboard layouts based on three design metrics: minimizing finger travel distance in order to maximize text entry speed, a new metric to maximize the quality of spell correction quality by minimizing neighbouring key ambiguity, and maximizing familiarity through a similarity function with the standard Qwerty layout. The paper describes the optimization process and resulting layouts for a standard trapezoid shaped keyboard and a more rectangular layout. Fitts' law modelling shows a predicted 11% improvement in entry speed without taking into account the significantly improved error correction potential and the subsequent effect on speed. In initial user tests typing speed dropped from approx. 21wpm with Qwerty to 13wpm (64%) on first use of our layout but recovered to 18wpm (85%) within four short trial sessions, and was still improving. NASA TLX forms showed no significant difference on load between Qwerty and our new layout use in the fourth session. Together we believe this shows the new layouts are faster and can be quickly adopted by users

    A Winnow-Based Approach to Context-Sensitive Spelling Correction

    Full text link
    A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts refer to only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. We present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for "causal", etc. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) The primary reason that WinSpell outperforms BaySpell is that WinSpell learns a better linear separator; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.Comment: To appear in Machine Learning, Special Issue on Natural Language Learning, 1999. 25 page

    An extended spell checker for unknown words

    Get PDF

    Figure Text Extraction in Biomedical Literature

    Get PDF
    Background: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engin

    Detection of semantic errors in Arabic texts

    Get PDF
    AbstractDetecting semantic errors in a text is still a challenging area of investigation. A lot of research has been done on lexical and syntactic errors while fewer studies have tackled semantic errors, as they are more difficult to treat. Compared to other languages, Arabic appears to be a special challenge for this problem. Because words are graphically very similar to each other, the risk of getting semantic errors in Arabic texts is bigger. Moreover, there are special cases and unique complexities for this language. This paper deals with the detection of semantic errors in Arabic texts but the approach we have adopted can also be applied for texts in other languages. It combines four contextual methods (using statistics and linguistic information) in order to decide about the semantic validity of a word in a sentence. We chose to implement our approach on a distributed architecture, namely, a Multi Agent System (MAS). The implemented system achieved a precision rate of about 90% and a recall rate of about 83%

    Contextual Spelling Correction Using Latent Semantic Analysis

    No full text
    Contextual spelling errors are defined as the use of an incorrect, though valid, word in a particular sentence or context. Traditional spelling checkers flag misspelled words, but they do not typically attempt to identify words that are used incorrectly in a sentence. We explore the use of Latent Semantic Analysis for correcting these incorrectly used words and the results are compared to earlier work based on a Bayesian classifier

    Semantic vector representations of senses, concepts and entities and their applications in natural language processing

    Get PDF
    Representation learning lies at the core of Artificial Intelligence (AI) and Natural Language Processing (NLP). Most recent research has focused on develop representations at the word level. In particular, the representation of words in a vector space has been viewed as one of the most important successes of lexical semantics and NLP in recent years. The generalization power and flexibility of these representations have enabled their integration into a wide variety of text-based applications, where they have proved extremely beneficial. However, these representations are hampered by an important limitation, as they are unable to model different meanings of the same word. In order to deal with this issue, in this thesis we analyze and develop flexible semantic representations of meanings, i.e. senses, concepts and entities. This finer distinction enables us to model semantic information at a deeper level, which in turn is essential for dealing with ambiguity. In addition, we view these (vector) representations as a connecting bridge between lexical resources and textual data, encoding knowledge from both sources. We argue that these sense-level representations, similarly to the importance of word embeddings, constitute a first step for seamlessly integrating explicit knowledge into NLP applications, while focusing on the deeper sense level. Its use does not only aim at solving the inherent lexical ambiguity of language, but also represents a first step to the integration of background knowledge into NLP applications. Multilinguality is another key feature of these representations, as we explore the construction language-independent and multilingual techniques that can be applied to arbitrary languages, and also across languages. We propose simple unsupervised and supervised frameworks which make use of these vector representations for word sense disambiguation, a key application in natural language understanding, and other downstream applications such as text categorization and sentiment analysis. Given the nature of the vectors, we also investigate their effectiveness for improving and enriching knowledge bases, by reducing the sense granularity of their sense inventories and extending them with domain labels, hypernyms and collocations
    corecore