14,437 research outputs found
Misspelling Oblivious Word Embeddings
In this paper we present a method to learn word embeddings that are resilient
to misspellings. Existing word embeddings have limited applicability to
malformed texts, which contain a non-negligible amount of out-of-vocabulary
words. We propose a method combining FastText with subwords and a supervised
task of learning misspelling patterns. In our method, misspellings of each word
are embedded close to their correct variants. We train these embeddings on a
new dataset we are releasing publicly. Finally, we experimentally show the
advantages of this approach on both intrinsic and extrinsic NLP tasks using
public test sets.Comment: 9 Page
Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings
We present an unsupervised context-sensitive spelling correction method for
clinical free-text that uses word and character n-gram embeddings. Our method
generates misspelling replacement candidates and ranks them according to their
semantic fit, by calculating a weighted cosine similarity between the
vectorized representation of a candidate and the misspelling context. To tune
the parameters of this model, we generate self-induced spelling error corpora.
We perform our experiments for two languages. For English, we greatly
outperform off-the-shelf spelling correction tools on a manually annotated
MIMIC-III test set, and counter the frequency bias of a noisy channel model,
showing that neural embeddings can be successfully exploited to improve upon
the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling
correction tool on manually annotated clinical records from the Antwerp
University Hospital, but can offer no empirical evidence that our method
counters the frequency bias of a noisy channel model in this case as well.
However, both our context-sensitive model and our implementation of the noisy
channel model obtain high scores on the test set, establishing a
state-of-the-art for Dutch clinical spelling correction with the noisy channel
model.Comment: Appears in volume 7 of the CLIN Journal,
http://www.clinjournal.org/biblio/volum
Ordering the suggestions of a spellchecker without using context.
Having located a misspelling, a spellchecker generally offers some suggestions for the intended word. Even without using context, a spellchecker can draw on various types of information in ordering its suggestions. A series of experiments is described, beginning with a basic corrector that implements a well-known algorithm for reversing single simple errors, and making successive enhancements to take account of substring matches, pronunciation, known error patterns, syllable structure and word frequency. The improvement in the ordering produced by each enhancement is measured on a large corpus of misspellings. The final version is tested on other corpora against a widely used commercial spellchecker and a research prototype
More blogging features for author identification
In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features.
Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets
The adaptation of an English spellchecker for Japanese writers
It has been pointed out that the spelling errors made by second-language writers writing in English have features that are to some extent characteristic of their first language, and the suggestion has been made that a spellchecker could be adapted to take account of these features. In the work reported here, a corpus of spelling errors made by Japanese writers writing in English was compared with a corpus of errors made by native speakers. While the great majority of errors were common to the two corpora, some distinctively Japanese error patterns were evident against this common background, notably a difficulty in deciding between the letters b and v, and the letters l and r, and a tendency to add syllables. A spellchecker that had been developed for native speakers of English was adapted to cope with these errors. A brief account is given of the spellcheckerās mode of operation to indicate how it lent itself to modifications of this kind. The native-speaker spellchecker and the Japanese-adapted version were run over the error corpora and the results show that these adaptations produced a modest but worthwhile improvement to the spellcheckerās performance in correcting Japanese-made errors
Fifty years of spellchecking
A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases
Bakelite and other Shibboleths: eBay listings and the 'policing' of 'amateur' collecting knowledges within the space of an online old radio forum
eBay, the online auction site, is composed of thousands of item descriptions constructed by sellers themselves. Sellers may be collectors or antiques experts, but often they are amateurs selling off unwanted items. As such, eBay becomes an unprecedented public space for the performance of amateur collecting and consumption knowledges where experts are being disintermediated by non-expert knowledges. These knowledges have become a major source
of discussion on an online old radio discussion forum and the case study presented here contends that amateur knowledges are strongly contested, often in separate online spaces, and as part of identity performance. While a ācult of the amateurā may be occurring online, it is not happening without a fight over knowledge and its performance. eBay is shown as a relational space to the forum, allowing radio experts to perform their own group identity and related practices - distinguished from those seen on eBay. This paper examines these distinctions in detail - the identifying traits or 'Shibboleths' of eBay amateurs - such as the incorrect spelling of 'Bakelite'.
- ā¦