968 research outputs found
Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings
We present an unsupervised context-sensitive spelling correction method for
clinical free-text that uses word and character n-gram embeddings. Our method
generates misspelling replacement candidates and ranks them according to their
semantic fit, by calculating a weighted cosine similarity between the
vectorized representation of a candidate and the misspelling context. To tune
the parameters of this model, we generate self-induced spelling error corpora.
We perform our experiments for two languages. For English, we greatly
outperform off-the-shelf spelling correction tools on a manually annotated
MIMIC-III test set, and counter the frequency bias of a noisy channel model,
showing that neural embeddings can be successfully exploited to improve upon
the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling
correction tool on manually annotated clinical records from the Antwerp
University Hospital, but can offer no empirical evidence that our method
counters the frequency bias of a noisy channel model in this case as well.
However, both our context-sensitive model and our implementation of the noisy
channel model obtain high scores on the test set, establishing a
state-of-the-art for Dutch clinical spelling correction with the noisy channel
model.Comment: Appears in volume 7 of the CLIN Journal,
http://www.clinjournal.org/biblio/volum
Weakly-supervised Fine-grained Event Recognition on Social Media Texts for Disaster Management
People increasingly use social media to report emergencies, seek help or
share information during disasters, which makes social networks an important
tool for disaster management. To meet these time-critical needs, we present a
weakly supervised approach for rapidly building high-quality classifiers that
label each individual Twitter message with fine-grained event categories. Most
importantly, we propose a novel method to create high-quality labeled data in a
timely manner that automatically clusters tweets containing an event keyword
and asks a domain expert to disambiguate event word senses and label clusters
quickly. In addition, to process extremely noisy and often rather short
user-generated messages, we enrich tweet representations using preceding
context tweets and reply tweets in building event recognition classifiers. The
evaluation on two hurricanes, Harvey and Florence, shows that using only 1-2
person-hours of human supervision, the rapidly trained weakly supervised
classifiers outperform supervised classifiers trained using more than ten
thousand annotated tweets created in over 50 person-hours.Comment: In Proceedings of the AAAI 2020 (AI for Social Impact Track). Link:
https://aaai.org/ojs/index.php/AAAI/article/view/539
FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an N-gram Language Model
Lexical substitutes have found use in areas such as paraphrasing, text
simplification, machine translation, word sense disambiguation, and part of
speech induction. However the computational complexity of accurately
identifying the most likely substitutes for a word has made large scale
experiments difficult. In this paper I introduce a new search algorithm,
FASTSUBS, that is guaranteed to find the K most likely lexical substitutes for
a given word in a sentence based on an n-gram language model. The computation
is sub-linear in both K and the vocabulary size V. An implementation of the
algorithm and a dataset with the top 100 substitutes of each token in the WSJ
section of the Penn Treebank are available at http://goo.gl/jzKH0.Comment: 4 pages, 1 figure, to appear in IEEE Signal Processing Letter
The interaction of knowledge sources in word sense disambiguation
Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from the tradition of combining different knowledge sources in artificial in telligence research. An important step in the exploration of this hypothesis is to determine which linguistic knowledge sources are most useful and whether their combination leads to improved results.
We present a sense tagger which uses several knowledge sources. Tested accuracy exceeds 94% on our evaluation corpus.Our system attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words. It is argued that this approach is more likely to assist the creation of practical systems
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
Mining question-answer pairs from web forum: a survey of challenges and resolutions
Internet forums, which are also known as discussion boards, are popular web applications. Members of the board discuss issues and share ideas to form a community within the board, and as a result generate huge amount of content on different topics on daily basis. Interest in information extraction and knowledge discovery from such sources has been on the increase in the research community. A number of factors are limiting the potentiality of mining knowledge from forums. Lexical chasm or lexical gap that renders some Natural Language Processing techniques (NLP) less effective, Informal tone that creates noisy data, drifting of discussion topic that prevents focused mining and asynchronous issue that makes it difficult to establish post-reply relationship are some of the problems that need to be addressed. This survey introduces these challenges within the framework of question answering. The survey provides description of the problems; cites and explores useful publications to the reader for further examination; provides an overview of resolution strategies and findings relevant to the challenges
- …