3,040 research outputs found
Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings
We present an unsupervised context-sensitive spelling correction method for
clinical free-text that uses word and character n-gram embeddings. Our method
generates misspelling replacement candidates and ranks them according to their
semantic fit, by calculating a weighted cosine similarity between the
vectorized representation of a candidate and the misspelling context. To tune
the parameters of this model, we generate self-induced spelling error corpora.
We perform our experiments for two languages. For English, we greatly
outperform off-the-shelf spelling correction tools on a manually annotated
MIMIC-III test set, and counter the frequency bias of a noisy channel model,
showing that neural embeddings can be successfully exploited to improve upon
the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling
correction tool on manually annotated clinical records from the Antwerp
University Hospital, but can offer no empirical evidence that our method
counters the frequency bias of a noisy channel model in this case as well.
However, both our context-sensitive model and our implementation of the noisy
channel model obtain high scores on the test set, establishing a
state-of-the-art for Dutch clinical spelling correction with the noisy channel
model.Comment: Appears in volume 7 of the CLIN Journal,
http://www.clinjournal.org/biblio/volum
Arabic Spelling Correction using Supervised Learning
In this work, we address the problem of spelling correction in the Arabic
language utilizing the new corpus provided by QALB (Qatar Arabic Language Bank)
project which is an annotated corpus of sentences with errors and their
corrections. The corpus contains edit, add before, split, merge, add after,
move and other error types. We are concerned with the first four error types as
they contribute more than 90% of the spelling errors in the corpus. The
proposed system has many models to address each error type on its own and then
integrating all the models to provide an efficient and robust system that
achieves an overall recall of 0.59, precision of 0.58 and F1 score of 0.58
including all the error types on the development set. Our system participated
in the QALB 2014 shared task "Automatic Arabic Error Correction" and achieved
an F1 score of 0.6, earning the sixth place out of nine participants.Comment: System description paper that is submitted in the EMNLP 2014
conference shared task "Automatic Arabic Error Correction" (Mohit et al.,
2014) in the Arabic NLP workshop. 6 page
Grammatical Error Correction: A Survey of the State of the Art
Grammatical Error Correction (GEC) is the task of automatically detecting and
correcting errors in text. The task not only includes the correction of
grammatical errors, such as missing prepositions and mismatched subject-verb
agreement, but also orthographic and semantic errors, such as misspellings and
word choice errors respectively. The field has seen significant progress in the
last decade, motivated in part by a series of five shared tasks, which drove
the development of rule-based methods, statistical classifiers, statistical
machine translation, and finally neural machine translation systems which
represent the current dominant state of the art. In this survey paper, we
condense the field into a single article and first outline some of the
linguistic challenges of the task, introduce the most popular datasets that are
available to researchers (for both English and other languages), and summarise
the various methods and techniques that have been developed with a particular
focus on artificial error generation. We next describe the many different
approaches to evaluation as well as concerns surrounding metric reliability,
especially in relation to subjective human judgements, before concluding with
an overview of recent progress and suggestions for future work and remaining
challenges. We hope that this survey will serve as comprehensive resource for
researchers who are new to the field or who want to be kept apprised of recent
developments
`The frozen accident' as an evolutionary adaptation: A rate distortion theory perspective on the dynamics and symmetries of genetic coding mechanisms
We survey some interpretations and related issues concerning the frozen hypothesis due to F. Crick and how it can be explained in terms of several natural mechanisms involving error correction codes, spin glasses, symmetry breaking and the characteristic robustness of genetic networks. The approach to most of these questions involves using elements of Shannon's rate distortion theory incorporating a semantic system which is meaningful for the relevant alphabets and vocabulary implemented in transmission of the genetic code. We apply the fundamental homology between information source uncertainty with the free energy density of a thermodynamical system with respect to transcriptional regulators and the communication channels of sequence/structure in proteins. This leads to the suggestion that the frozen accident may have been a type of evolutionary adaptation
Correcting input noise in SMT as a char-based translation problem
Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin
- âŚ