6,120 research outputs found
(Self-Attentive) Autoencoder-based Universal Language Representation for Machine Translation
Universal language representation is the holy grail in machine translation
(MT). Thanks to the new neural MT approach, it seems that there are good
perspectives towards this goal. In this paper, we propose a new architecture
based on combining variational autoencoders with encoder-decoders and
introducing an interlingual loss as an additional training objective. By adding
and forcing this interlingual loss, we are able to train multiple encoders and
decoders for each language, sharing a common universal representation. Since
the final objective of this universal representation is producing close results
for similar input sentences (in any language), we propose to evaluate it by
encoding the same sentence in two different languages, decoding both latent
representations into the same language and comparing both outputs. Preliminary
results on the WMT 2017 Turkish/English task shows that the proposed
architecture is capable of learning a universal language representation and
simultaneously training both translation directions with state-of-the-art
results.Comment: 7 pages, 4 figure
A neural interlingua for multilingual machine translation
We incorporate an explicit neural interlingua into a multilingual
encoder-decoder neural machine translation (NMT) architecture. We demonstrate
that our model learns a language-independent representation by performing
direct zero-shot translation (without using pivot translation), and by using
the source sentence embeddings to create an English Yelp review classifier
that, through the mediation of the neural interlingua, can also classify French
and German reviews. Furthermore, we show that, despite using a smaller number
of parameters than a pairwise collection of bilingual NMT models, our approach
produces comparable BLEU scores for each language pair in WMT15.Comment: Accepted in WMT 1
Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages
Measuring the semantic similarity between two sentences (or Semantic Textual
Similarity - STS) is fundamental in many NLP applications. Despite the
remarkable results in supervised settings with adequate labeling, little
attention has been paid to this task in low-resource languages with
insufficient labeling. Existing approaches mostly leverage machine translation
techniques to translate sentences into rich-resource language. These approaches
either beget language biases, or be impractical in industrial applications
where spoken language scenario is more often and rigorous efficiency is
required. In this work, we propose a multilingual framework to tackle the STS
task in a low-resource language e.g. Spanish, Arabic , Indonesian and Thai, by
utilizing the rich annotation data in a rich resource language, e.g. English.
Our approach is extended from a basic monolingual STS framework to a shared
multilingual encoder pretrained with translation task to incorporate
rich-resource language data. By exploiting the nature of a shared multilingual
encoder, one sentence can have multiple representations for different target
translation language, which are used in an ensemble model to improve similarity
evaluation. We demonstrate the superiority of our method over other state of
the art approaches on SemEval STS task by its significant improvement on non-MT
method, as well as an online industrial product where MT method fails to beat
baseline while our approach still has consistently improvements
XNLI: Evaluating Cross-lingual Sentence Representations
State-of-the-art natural language processing systems rely on supervision in
the form of annotated data to learn competent models. These models are
generally trained on data in a single language (usually English), and cannot be
directly used beyond that language. Since collecting data in every language is
not realistic, there has been a growing interest in cross-lingual language
understanding (XLU) and low-resource cross-language transfer. In this work, we
construct an evaluation set for XLU by extending the development and test sets
of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15
languages, including low-resource languages such as Swahili and Urdu. We hope
that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence
understanding by providing an informative standard evaluation task. In
addition, we provide several baselines for multilingual sentence understanding,
including two based on machine translation systems, and two that use parallel
data to train aligned multilingual bag-of-words and LSTM encoders. We find that
XNLI represents a practical and challenging evaluation suite, and that directly
translating the test data yields the best performance among available
baselines.Comment: EMNLP 201
Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model
A significant roadblock in multilingual neural language modeling is the lack
of labeled non-English data. One potential method for overcoming this issue is
learning cross-lingual text representations that can be used to transfer the
performance from training on English tasks to non-English tasks, despite little
to no task-specific non-English data. In this paper, we explore a natural setup
for learning cross-lingual sentence representations: the dual-encoder. We
provide a comprehensive evaluation of our cross-lingual representations on a
number of monolingual, cross-lingual, and zero-shot/few-shot learning tasks,
and also give an analysis of different learned cross-lingual embedding spaces.Comment: Accepted at the 4th Workshop on Representation Learning for NLP
(RepL4NLP-2019
Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog
One of the first steps in the utterance interpretation pipeline of many
task-oriented conversational AI systems is to identify user intents and the
corresponding slots. Since data collection for machine learning models for this
task is time-consuming, it is desirable to make use of existing data in a
high-resource language to train models in low-resource languages. However,
development of such models has largely been hindered by the lack of
multilingual training data. In this paper, we present a new data set of 57k
annotated utterances in English (43k), Spanish (8.6k) and Thai (5k) across the
domains weather, alarm, and reminder. We use this data set to evaluate three
different cross-lingual transfer methods: (1) translating the training data,
(2) using cross-lingual pre-trained embeddings, and (3) a novel method of using
a multilingual machine translation encoder as contextual word representations.
We find that given several hundred training examples in the the target
language, the latter two methods outperform translating the training data.
Further, in very low-resource settings, multilingual contextual word
representations give better results than using cross-lingual static embeddings.
We also compare the cross-lingual methods to using monolingual resources in the
form of contextual ELMo representations and find that given just small amounts
of target language data, this method outperforms all cross-lingual methods,
which highlights the need for more sophisticated cross-lingual methods.Comment: 11 pages, to be presented at NAACL 201
Towards Interlingua Neural Machine Translation
Common intermediate language representation in neural machine translation can
be used to extend bilingual to multilingual systems by incremental training. In
this paper, we propose a new architecture based on introducing an interlingual
loss as an additional training objective. By adding and forcing this
interlingual loss, we are able to train multiple encoders and decoders for each
language, sharing a common intermediate representation. Translation results on
the low-resourced tasks (Turkish-English and Kazakh-English tasks, from the
popular Workshop on Machine Translation benchmark) show the following BLEU
improvements up to 2.8. However, results on a larger dataset (Russian-English
and Kazakh-English, from the same baselines) show BLEU loses if the same
amount. While our system is only providing improvements for the low-resourced
tasks in terms of translation quality, our system is capable of quickly
deploying new language pairs without retraining the rest of the system, which
may be a game-changer in some situations (i.e. in a disaster crisis where
international help is required towards a small region or to develop some
translation system for a client). Precisely, what is most relevant from our
architecture is that it is capable of: (1) reducing the number of production
systems, with respect to the number of languages, from quadratic to linear (2)
incrementally adding a new language in the system without retraining languages
previously there and (3) allowing for translations from the new language to all
the others present in the systemComment: arXiv admin note: substantial text overlap with arXiv:1810.0635
Self-Attentive Model for Headline Generation
Headline generation is a special type of text summarization task. While the
amount of available training data for this task is almost unlimited, it still
remains challenging, as learning to generate headlines for news articles
implies that the model has strong reasoning about natural language. To overcome
this issue, we applied recent Universal Transformer architecture paired with
byte-pair encoding technique and achieved new state-of-the-art results on the
New York Times Annotated corpus with ROUGE-L F1-score 24.84 and ROUGE-2
F1-score 13.48. We also present the new RIA corpus and reach ROUGE-L F1-score
36.81 and ROUGE-2 F1-score 22.15 on it.Comment: accepted for ECIR 201
Multi-task Learning for Universal Sentence Embeddings: A Thorough Evaluation using Transfer and Auxiliary Tasks
Learning distributed sentence representations is one of the key challenges in
natural language processing. Previous work demonstrated that a recurrent neural
network (RNNs) based sentence encoder trained on a large collection of
annotated natural language inference data, is efficient in the transfer
learning to facilitate other related tasks. In this paper, we show that joint
learning of multiple tasks results in better generalizable sentence
representations by conducting extensive experiments and analysis comparing the
multi-task and single-task learned sentence encoders. The quantitative analysis
using auxiliary tasks show that multi-task learning helps to embed better
semantic information in the sentence representations compared to single-task
learning. In addition, we compare multi-task sentence encoders with
contextualized word representations and show that combining both of them can
further boost the performance of transfer learning
Audio-Linguistic Embeddings for Spoken Sentences
We propose spoken sentence embeddings which capture both acoustic and
linguistic content. While existing works operate at the character, phoneme, or
word level, our method learns long-term dependencies by modeling speech at the
sentence level. Formulated as an audio-linguistic multitask learning problem,
our encoder-decoder model simultaneously reconstructs acoustic and natural
language features from audio. Our results show that spoken sentence embeddings
outperform phoneme and word-level baselines on speech recognition and emotion
recognition tasks. Ablation studies show that our embeddings can better model
high-level acoustic concepts while retaining linguistic content. Overall, our
work illustrates the viability of generic, multi-modal sentence embeddings for
spoken language understanding.Comment: International Conference on Acoustics, Speech, and Signal Processing
(ICASSP) 201
- …