14,437 research outputs found

    Misspelling Oblivious Word Embeddings

    Full text link
    In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.Comment: 9 Page

    Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

    Full text link
    We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. To tune the parameters of this model, we generate self-induced spelling error corpora. We perform our experiments for two languages. For English, we greatly outperform off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of a noisy channel model, showing that neural embeddings can be successfully exploited to improve upon the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling correction tool on manually annotated clinical records from the Antwerp University Hospital, but can offer no empirical evidence that our method counters the frequency bias of a noisy channel model in this case as well. However, both our context-sensitive model and our implementation of the noisy channel model obtain high scores on the test set, establishing a state-of-the-art for Dutch clinical spelling correction with the noisy channel model.Comment: Appears in volume 7 of the CLIN Journal, http://www.clinjournal.org/biblio/volum

    Ordering the suggestions of a spellchecker without using context.

    Get PDF
    Having located a misspelling, a spellchecker generally offers some suggestions for the intended word. Even without using context, a spellchecker can draw on various types of information in ordering its suggestions. A series of experiments is described, beginning with a basic corrector that implements a well-known algorithm for reversing single simple errors, and making successive enhancements to take account of substring matches, pronunciation, known error patterns, syllable structure and word frequency. The improvement in the ordering produced by each enhancement is measured on a large corpus of misspellings. The final version is tested on other corpora against a widely used commercial spellchecker and a research prototype

    More blogging features for author identification

    Get PDF
    In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets

    The adaptation of an English spellchecker for Japanese writers

    Get PDF
    It has been pointed out that the spelling errors made by second-language writers writing in English have features that are to some extent characteristic of their first language, and the suggestion has been made that a spellchecker could be adapted to take account of these features. In the work reported here, a corpus of spelling errors made by Japanese writers writing in English was compared with a corpus of errors made by native speakers. While the great majority of errors were common to the two corpora, some distinctively Japanese error patterns were evident against this common background, notably a difficulty in deciding between the letters b and v, and the letters l and r, and a tendency to add syllables. A spellchecker that had been developed for native speakers of English was adapted to cope with these errors. A brief account is given of the spellcheckerā€™s mode of operation to indicate how it lent itself to modifications of this kind. The native-speaker spellchecker and the Japanese-adapted version were run over the error corpora and the results show that these adaptations produced a modest but worthwhile improvement to the spellcheckerā€™s performance in correcting Japanese-made errors

    Fifty years of spellchecking

    Get PDF
    A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases

    Humour and Misspelling

    Get PDF

    Bakelite and other Shibboleths: eBay listings and the 'policing' of 'amateur' collecting knowledges within the space of an online old radio forum

    Get PDF
    eBay, the online auction site, is composed of thousands of item descriptions constructed by sellers themselves. Sellers may be collectors or antiques experts, but often they are amateurs selling off unwanted items. As such, eBay becomes an unprecedented public space for the performance of amateur collecting and consumption knowledges where experts are being disintermediated by non-expert knowledges. These knowledges have become a major source of discussion on an online old radio discussion forum and the case study presented here contends that amateur knowledges are strongly contested, often in separate online spaces, and as part of identity performance. While a ā€˜cult of the amateurā€™ may be occurring online, it is not happening without a fight over knowledge and its performance. eBay is shown as a relational space to the forum, allowing radio experts to perform their own group identity and related practices - distinguished from those seen on eBay. This paper examines these distinctions in detail - the identifying traits or 'Shibboleths' of eBay amateurs - such as the incorrect spelling of 'Bakelite'.
    • ā€¦
    corecore