61 research outputs found
Distributional semantics and machine learning for statistical machine translation
[EU]Lan honetan semantika distribuzionalaren eta ikasketa automatikoaren erabilera aztertzen
dugu itzulpen automatiko estatistikoa hobetzeko. Bide horretan, erregresio logistikoan
oinarritutako ikasketa automatikoko eredu bat proposatzen dugu hitz-segiden itzulpen-
probabilitatea modu dinamikoan modelatzeko. Proposatutako eredua itzulpen automatiko
estatistikoko ohiko itzulpen-probabilitateen orokortze bat dela frogatzen dugu, eta testuinguruko nahiz semantika distribuzionaleko informazioa barneratzeko baliatu ezaugarri
lexiko, hitz-cluster eta hitzen errepresentazio bektorialen bidez. Horretaz gain, semantika
distribuzionaleko ezagutza itzulpen automatiko estatistikoan txertatzeko beste hurbilpen
bat lantzen dugu: hitzen errepresentazio bektorial elebidunak erabiltzea hitz-segiden
itzulpenen antzekotasuna modelatzeko. Gure esperimentuek proposatutako ereduen baliagarritasuna erakusten dute, emaitza itxaropentsuak eskuratuz oinarrizko sistema sendo
baten gainean. Era berean, gure lanak ekarpen garrantzitsuak egiten ditu errepresentazio
bektorialen mapaketa elebidunei eta hitzen errepresentazio bektorialetan oinarritutako
hitz-segiden antzekotasun neurriei dagokienean, itzulpen automatikoaz haratago balio
propio bat dutenak semantika distribuzionalaren arloan.[EN]In this work, we explore the use of distributional semantics and machine learning to
improve statistical machine translation. For that purpose, we propose the use of a logistic
regression based machine learning model for dynamic phrase translation probability mod-
eling. We prove that the proposed model can be seen as a generalization of the standard
translation probabilities used in statistical machine translation, and use it to incorporate
context and distributional semantic information through lexical, word cluster and word
embedding features. Apart from that, we explore the use of word embeddings for phrase
translation probability scoring as an alternative approach to incorporate distributional
semantic knowledge into statistical machine translation. Our experiments show the
effectiveness of the proposed models, achieving promising results over a strong baseline.
At the same time, our work makes important contributions in relation to bilingual word
embedding mappings and word embedding based phrase similarity measures, which go be-
yond machine translation and have an intrinsic value in the field of distributional semantics
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence
representations for 93 languages, belonging to more than 30 different language
families and written in 28 different scripts. Our system uses a single BiLSTM
encoder with a shared BPE vocabulary for all languages, which is coupled with
an auxiliary decoder and trained on publicly available parallel corpora. This
enables us to learn a classifier on top of the resulting sentence embeddings
using English annotated data only, and transfer it to any of the 93 languages
without any modification. Our approach sets a new state-of-the-art on zero-shot
cross-lingual natural language inference for all the 14 languages in the XNLI
dataset but one. We also achieve very competitive results in cross-lingual
document classification (MLDoc dataset). Our sentence embeddings are also
strong at parallel corpus mining, establishing a new state-of-the-art in the
BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new
test set of aligned sentences in 122 languages based on the Tatoeba corpus, and
show that our sentence embeddings obtain strong results in multilingual
similarity search even for low-resource languages. Our PyTorch implementation,
pre-trained encoder and the multilingual test set will be freely available
Translation Artifacts in Cross-lingual Transfer Learning
Both human and machine translation play a central role in cross-lingual
transfer learning: many multilingual datasets have been created through
professional translation services, and using machine translation to translate
either the test set or the training set is a widely used transfer technique. In
this paper, we show that such translation process can introduce subtle
artifacts that have a notable impact in existing cross-lingual models. For
instance, in natural language inference, translating the premise and the
hypothesis independently can reduce the lexical overlap between them, which
current models are highly sensitive to. We show that some previous findings in
cross-lingual transfer learning need to be reconsidered in the light of this
phenomenon. Based on the gained insights, we also improve the state-of-the-art
in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points,
respectively.Comment: EMNLP 202
On the Cross-lingual Transferability of Monolingual Representations
State-of-the-art unsupervised multilingual models (e.g., multilingual BERT)
have been shown to generalize in a zero-shot cross-lingual setting. This
generalization ability has been attributed to the use of a shared subword
vocabulary and joint training across multiple languages giving rise to deep
multilingual abstractions. We evaluate this hypothesis by designing an
alternative approach that transfers a monolingual model to new languages at the
lexical level. More concretely, we first train a transformer-based masked
language model on one language, and transfer it to a new language by learning a
new embedding matrix with the same masked language modeling objective, freezing
parameters of all other layers. This approach does not rely on a shared
vocabulary or joint training. However, we show that it is competitive with
multilingual BERT on standard cross-lingual classification benchmarks and on a
new Cross-lingual Question Answering Dataset (XQuAD). Our results contradict
common beliefs of the basis of the generalization ability of multilingual
models and suggest that deep monolingual models learn some abstractions that
generalize across languages. We also release XQuAD as a more comprehensive
cross-lingual benchmark, which comprises 240 paragraphs and 1190
question-answer pairs from SQuAD v1.1 translated into ten languages by
professional translators.Comment: ACL 202
An Effective Approach to Unsupervised Machine Translation
While machine translation has traditionally relied on large amounts of
parallel corpora, a recent research line has managed to train both Neural
Machine Translation (NMT) and Statistical Machine Translation (SMT) systems
using monolingual corpora only. In this paper, we identify and address several
deficiencies of existing unsupervised SMT approaches by exploiting subword
information, developing a theoretically well founded unsupervised tuning
method, and incorporating a joint refinement procedure. Moreover, we use our
improved SMT system to initialize a dual NMT model, which is further fine-tuned
through on-the-fly back-translation. Together, we obtain large improvements
over the previous state-of-the-art in unsupervised machine translation. For
instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points
more than the previous best unsupervised system, and 0.5 points more than the
(supervised) shared task winner back in 2014.Comment: ACL 201
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Recent work has managed to learn cross-lingual word embeddings without
parallel data by mapping monolingual embeddings to a shared space through
adversarial training. However, their evaluation has focused on favorable
conditions, using comparable corpora or closely-related languages, and we show
that they often fail in more realistic scenarios. This work proposes an
alternative approach based on a fully unsupervised initialization that
explicitly exploits the structural similarity of the embeddings, and a robust
self-learning algorithm that iteratively improves this solution. Our method
succeeds in all tested scenarios and obtains the best published results in
standard datasets, even surpassing previous supervised systems. Our
implementation is released as an open source project at
https://github.com/artetxem/vecmapComment: ACL 201
Bilingual Lexicon Induction through Unsupervised Machine Translation
A recent research line has obtained strong results on bilingual lexicon
induction by aligning independently trained word embeddings in two languages
and using the resulting cross-lingual embeddings to induce word translation
pairs through nearest neighbor or related retrieval methods. In this paper, we
propose an alternative approach to this problem that builds on the recent work
on unsupervised machine translation. This way, instead of directly inducing a
bilingual lexicon from cross-lingual embeddings, we use them to build a
phrase-table, combine it with a language model, and use the resulting machine
translation system to generate a synthetic parallel corpus, from which we
extract the bilingual lexicon using statistical word alignment techniques. As
such, our method can work with any word embedding and cross-lingual mapping
technique, and it does not require any additional resource besides the
monolingual corpus used to train the embeddings. When evaluated on the exact
same cross-lingual embeddings, our proposed method obtains an average
improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS
retrieval, establishing a new state-of-the-art in the standard MUSE dataset.Comment: ACL 201
- …