4 research outputs found
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings
This paper presents an effective approach for parallel corpus mining using
bilingual sentence embeddings. Our embedding models are trained to produce
similar representations exclusively for bilingual sentence pairs that are
translations of each other. This is achieved using a novel training method that
introduces hard negatives consisting of sentences that are not translations but
that have some degree of semantic similarity. The quality of the resulting
embeddings are evaluated on parallel corpus reconstruction and by assessing
machine translation systems trained on gold vs. mined sentence pairs. We find
that the sentence embeddings can be used to reconstruct the United Nations
Parallel Corpus at the sentence level with a precision of 48.9% for en-fr and
54.9% for en-es. When adapted to document level matching, we achieve a parallel
document matching accuracy that is comparable to the significantly more
computationally intensive approach of [Jakob 2010]. Using reconstructed
parallel data, we are able to train NMT models that perform nearly as well as
models trained on the original data (within 1-2 BLEU)
Hierarchical Document Encoder for Parallel Corpus Mining
We explore using multilingual document embeddings for nearest neighbor mining
of parallel data. Three document-level representations are investigated: (i)
document embeddings generated by simply averaging multilingual sentence
embeddings; (ii) a neural bag-of-words (BoW) document encoding model; (iii) a
hierarchical multilingual document encoder (HiDE) that builds on our
sentence-level model. The results show document embeddings derived from
sentence-level averaging are surprisingly effective for clean datasets, but
suggest models trained hierarchically at the document-level are more effective
on noisy data. Analysis experiments demonstrate our hierarchical models are
very robust to variations in the underlying sentence embedding quality. Using
document embeddings trained with HiDE achieves state-of-the-art performance on
United Nations (UN) parallel document mining, 94.9% P@1 for en-fr and 97.3% P@1
for en-es.Comment: accepted by WMT201
Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora
Large web-crawled corpora represent an excellent resource for improving the
performance of Neural Machine Translation (NMT) systems across several language
pairs. However, since these corpora are typically extremely noisy, their use is
fairly limited. Current approaches to dealing with this problem mainly focus on
filtering using heuristics or single features such as language model scores or
bi-lingual similarity. This work presents an alternative approach which learns
weights for multiple sentence-level features. These feature weights which are
optimized directly for the task of improving translation performance, are used
to score and filter sentences in the noisy corpora more effectively. We provide
results of applying this technique to building NMT systems using the Paracrawl
corpus for Estonian-English and show that it beats strong single feature
baselines and hand designed combinations. Additionally, we analyze the
sensitivity of this method to different types of noise and explore if the
learned weights generalize to other language pairs using the Maltese-English
Paracrawl corpus.Comment: 10 pages, 2 figure
Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax
In this paper, we present an approach to learn multilingual sentence
embeddings using a bi-directional dual-encoder with additive margin softmax.
The embeddings are able to achieve state-of-the-art results on the United
Nations (UN) parallel corpus retrieval task. In all the languages tested, the
system achieves P@1 of 86% or higher. We use pairs retrieved by our approach to
train NMT models that achieve similar performance to models trained on gold
pairs. We explore simple document-level embeddings constructed by averaging our
sentence embeddings. On the UN document-level retrieval task, document
embeddings achieve around 97% on P@1 for all experimented language pairs.
Lastly, we evaluate the proposed model on the BUCC mining task. The learned
embeddings with raw cosine similarity scores achieve competitive results
compared to current state-of-the-art models, and with a second-stage scorer we
achieve a new state-of-the-art level on this task.Comment: Accepted by IJCAI'19(International Joint Conference on Artificial
Intelligence