Search CORE

14,437 research outputs found

Misspelling Oblivious Word Embeddings

Author: Bojanowski Piotr
Edizel Bora
Ferreira Rui
Grave Edouard
Piktus Aleksandra
Silvestri Fabrizio
Publication venue
Publication date: 01/01/2019
Field of study

In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.Comment: 9 Page

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

Author: Daelemans Walter
Fivez Pieter
Šuster Simon
Publication venue
Publication date: 01/01/2017
Field of study

We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. To tune the parameters of this model, we generate self-induced spelling error corpora. We perform our experiments for two languages. For English, we greatly outperform off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of a noisy channel model, showing that neural embeddings can be successfully exploited to improve upon the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling correction tool on manually annotated clinical records from the Antwerp University Hospital, but can offer no empirical evidence that our method counters the frequency bias of a noisy channel model in this case as well. However, both our context-sensitive model and our implementation of the noisy channel model obtain high scores on the test set, establishing a state-of-the-art for Dutch clinical spelling correction with the noisy channel model.Comment: Appears in volume 7 of the CLIN Journal, http://www.clinjournal.org/biblio/volum

arXiv.org e-Print Archive

Institutional Repository Universiteit Antwerpen

Ordering the suggestions of a spellchecker without using context.

Author: Mitton Roger
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2009
Field of study

Having located a misspelling, a spellchecker generally offers some suggestions for the intended word. Even without using context, a spellchecker can draw on various types of information in ordering its suggestions. A series of experiments is described, beginning with a basic corrector that implements a well-known algorithm for reversing single simple errors, and making successive enhancements to take account of substring matches, pronunciation, known error patterns, syllable structure and word frequency. The improvement in the ordering produced by each enhancement is measured on a large corpus of misspellings. The final version is tested on other corpora against a widely used commercial spellchecker and a research prototype

Birkbeck Institutional Research Online

More blogging features for author identification

Author: Ahmed Amr
Mohtasseb Haytham
Publication venue
Publication date: 01/01/2009
Field of study

In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets

University of Lincoln Institutional Repository

CiteSeerX

Edge Hill University Research Information Repository

The adaptation of an English spellchecker for Japanese writers

Author: Mitton Roger
Okada T.
Publication venue
Publication date: 01/09/2007
Field of study

It has been pointed out that the spelling errors made by second-language writers writing in English have features that are to some extent characteristic of their first language, and the suggestion has been made that a spellchecker could be adapted to take account of these features. In the work reported here, a corpus of spelling errors made by Japanese writers writing in English was compared with a corpus of errors made by native speakers. While the great majority of errors were common to the two corpora, some distinctively Japanese error patterns were evident against this common background, notably a difficulty in deciding between the letters b and v, and the letters l and r, and a tendency to add syllables. A spellchecker that had been developed for native speakers of English was adapted to cope with these errors. A brief account is given of the spellchecker’s mode of operation to indicate how it lent itself to modifications of this kind. The native-speaker spellchecker and the Japanese-adapted version were run over the error corpora and the results show that these adaptations produced a modest but worthwhile improvement to the spellchecker’s performance in correcting Japanese-made errors

Birkbeck Institutional Research Online

Fifty years of spellchecking

Author: Blair CR
Brooks G
Carlson AJ
Cucerzan S
Damerau FJ
Damerau FJ
Golding AR
Golding AR
Leech G
Levenshtein VI
McIlroy MD
Mihov S
Mitton R
Mitton R
Mitton R
Mitton R
Morris R
Oflazer K
Pedler J
Peterson JL
Peterson JL
Pollock JL
Roger Mitton
Savary A
Sterling CM
Veronis J
Wagner RA
Wing AM
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2010
Field of study

A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases

Crossref

Birkbeck Institutional Research Online

Humour and Misspelling

Author: Hayashi Sachiko
林幸子
Publication venue: 神奈川大学
Publication date: 28/03/1986
Field of study

KANAGAWA University Repository

Bakelite and other Shibboleths: eBay listings and the 'policing' of 'amateur' collecting knowledges within the space of an online old radio forum

Author: Ellis Rebecca
Publication venue: Centre for Research in Economic Sociology and Innovation (CRESI) Working Paper 2009-05
Publication date: 01/01/2009
Field of study

eBay, the online auction site, is composed of thousands of item descriptions constructed by sellers themselves. Sellers may be collectors or antiques experts, but often they are amateurs selling off unwanted items. As such, eBay becomes an unprecedented public space for the performance of amateur collecting and consumption knowledges where experts are being disintermediated by non-expert knowledges. These knowledges have become a major source of discussion on an online old radio discussion forum and the case study presented here contends that amateur knowledges are strongly contested, often in separate online spaces, and as part of identity performance. While a ‘cult of the amateur’ may be occurring online, it is not happening without a fight over knowledge and its performance. eBay is shown as a relational space to the forum, allowing radio experts to perform their own group identity and related practices - distinguished from those seen on eBay. This paper examines these distinctions in detail - the identifying traits or 'Shibboleths' of eBay amateurs - such as the incorrect spelling of 'Bakelite'.

University of Essex Research Repository