6 research outputs found
Improving Robustness of Machine Translation with Synthetic Noise
Modern Machine Translation (MT) systems perform consistently well on clean,
in-domain text. However most human generated text, particularly in the realm of
social media, is full of typos, slang, dialect, idiolect and other noise which
can have a disastrous impact on the accuracy of output translation. In this
paper we leverage the Machine Translation of Noisy Text (MTNT) dataset to
enhance the robustness of MT systems by emulating naturally occurring noise in
otherwise clean data. Synthesizing noise in this manner we are ultimately able
to make a vanilla MT system resilient to naturally occurring noise and
partially mitigate loss in accuracy resulting therefrom.Comment: Accepted at NAACL 201
Language Model Adaptation for Language and Dialect Identification of Text
This article describes an unsupervised language model adaptation approach
that can be used to enhance the performance of language identification methods.
The approach is applied to a current version of the HeLI language
identification method, which is now called HeLI 2.0. We describe the HeLI 2.0
method in detail. The resulting system is evaluated using the datasets from the
German dialect identification and Indo-Aryan language identification shared
tasks of the VarDial workshops 2017 and 2018. The new approach with language
identification provides considerably higher F1-scores than the previous HeLI
method or the other systems which participated in the shared tasks. The results
indicate that unsupervised language model adaptation should be considered as an
option in all language identification tasks, especially in those where
encountering out-of-domain data is likely
MTNT: A Testbed for Machine Translation of Noisy Text
Noisy or non-standard input text can cause disastrous mistranslations in most
modern Machine Translation (MT) systems, and there has been growing research
interest in creating noise-robust MT systems. However, as of yet there are no
publicly available parallel corpora of with naturally occurring noisy inputs
and translations, and thus previous work has resorted to evaluating on
synthetically created datasets. In this paper, we propose a benchmark dataset
for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on
Reddit (www.reddit.com) and professionally sourced translations. We
commissioned translations of English comments into French and Japanese, as well
as French and Japanese comments into English, on the order of 7k-37k sentences
per language pair. We qualitatively and quantitatively examine the types of
noise included in this dataset, then demonstrate that existing MT models fail
badly on a number of noise-related phenomena, even after performing adaptation
on a small training set of in-domain data. This indicates that this dataset can
provide an attractive testbed for methods tailored to handling noisy text in
MT. The data is publicly available at www.cs.cmu.edu/~pmichel1/mtnt/.Comment: EMNLP 2018 Long Pape
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Large text corpora are increasingly important for a wide variety of Natural
Language Processing (NLP) tasks, and automatic language identification (LangID)
is a core technology needed to collect such datasets in a multilingual context.
LangID is largely treated as solved in the literature, with models reported
that achieve over 90% average F1 on as many as 1,366 languages. We train LangID
models on up to 1,629 languages with comparable quality on held-out test sets,
but find that human-judged LangID accuracy for web-crawl text corpora created
using these models is only around 5% for many lower-resource languages,
suggesting a need for more robust evaluation. Further analysis revealed a
variety of error modes, arising from domain mismatch, class imbalance, language
similarity, and insufficiently expressive models. We propose two classes of
techniques to mitigate these errors: wordlist-based tunable-precision filters
(for which we release curated lists in about 500 languages) and
transformer-based semi-supervised LangID models, which increase median dataset
precision from 5.5% to 71.2%. These techniques enable us to create an initial
data set covering 100K or more relatively clean sentences in each of 500+
languages, paving the way towards a 1,000-language web text corpus.Comment: Accepted to COLING 2020. 9 pages with 8 page abstrac
Language Identification on Massive Datasets of Short Message using an Attention Mechanism CNN
Language Identification (LID) is a challenging task, especially when the
input texts are short and noisy such as posts and statuses on social media or
chat logs on gaming forums. The task has been tackled by either designing a
feature set for a traditional classifier (e.g. Naive Bayes) or applying a deep
neural network classifier (e.g. Bi-directional Gated Recurrent Unit,
Encoder-Decoder). These methods are usually trained and tested on a huge amount
of private data, then used and evaluated as off-the-shelf packages by other
researchers using their own datasets, and consequently the various results
published are not directly comparable. In this paper, we first create a new
massive labelled dataset based on one year of Twitter data. We use this dataset
to test several existing language identification systems, in order to obtain a
set of coherent benchmarks, and we make our dataset publicly available so that
others can add to this set of benchmarks. Finally, we propose a shallow but
efficient neural LID system, which is a ngram-regional convolution neural
network enhanced with an attention mechanism. Experimental results show that
our architecture is able to predict tens of thousands of samples per second and
surpasses all state-of-the-art systems with an improvement of 5%.Comment: 9 pages, 5 tables, 1 figur
Towards Unbiased and Accurate Deferral to Multiple Experts
Machine learning models are often implemented in cohort with humans in the
pipeline, with the model having an option to defer to a domain expert in cases
where it has low confidence in its inference. Our goal is to design mechanisms
for ensuring accuracy and fairness in such prediction systems that combine
machine learning model inferences and domain expert predictions. Prior work on
"deferral systems" in classification settings has focused on the setting of a
pipeline with a single expert and aimed to accommodate the inaccuracies and
biases of this expert to simultaneously learn an inference model and a deferral
system. Our work extends this framework to settings where multiple experts are
available, with each expert having their own domain of expertise and biases. We
propose a framework that simultaneously learns a classifier and a deferral
system, with the deferral system choosing to defer to one or more human experts
in cases of input where the classifier has low confidence. We test our
framework on a synthetic dataset and a content moderation dataset with biased
synthetic experts, and show that it significantly improves the accuracy and
fairness of the final predictions, compared to the baselines. We also collect
crowdsourced labels for the content moderation task to construct a real-world
dataset for the evaluation of hybrid machine-human frameworks and show that our
proposed learning framework outperforms baselines on this real-world dataset as
well.Comment: This paper has been accepted for publication at the AAAI/ACM
Conference on Artificial Intelligence, Ethics, and Society (AIES 2021