8,583 research outputs found
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
End-to-end neural machine translation has overtaken statistical machine
translation in terms of translation quality for some language pairs, specially
those with large amounts of parallel data. Besides this palpable improvement,
neural networks provide several new properties. A single system can be trained
to translate between many languages at almost no additional cost other than
training time. Furthermore, internal representations learned by the network
serve as a new semantic representation of words -or sentences- which, unlike
standard word embeddings, are learned in an essentially bilingual or even
multilingual context. In view of these properties, the contribution of the
present work is two-fold. First, we systematically study the NMT context
vectors, i.e. output of the encoder, and their power as an interlingua
representation of a sentence. We assess their quality and effectiveness by
measuring similarities across translations, as well as semantically related and
semantically unrelated sentence pairs. Second, as extrinsic evaluation of the
first point, we identify parallel sentences in comparable corpora, obtaining an
F1=98.2% on data from a shared task when using only NMT context vectors. Using
context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure
Learning Bilingual Word Representations by Marginalizing Alignments
We present a probabilistic model that simultaneously learns alignments and
distributed representations for bilingual data. By marginalizing over word
alignments the model captures a larger semantic context than prior work relying
on hard alignments. The advantage of this approach is demonstrated in a
cross-lingual classification task, where we outperform the prior published
state of the art.Comment: Proceedings of ACL 2014 (Short Papers
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed
sentence encoder. Using the continuity of text from books, we train an
encoder-decoder model that tries to reconstruct the surrounding sentences of an
encoded passage. Sentences that share semantic and syntactic properties are
thus mapped to similar vector representations. We next introduce a simple
vocabulary expansion method to encode words that were not seen as part of
training, allowing us to expand our vocabulary to a million words. After
training our model, we extract and evaluate our vectors with linear models on 8
tasks: semantic relatedness, paraphrase detection, image-sentence ranking,
question-type classification and 4 benchmark sentiment and subjectivity
datasets. The end result is an off-the-shelf encoder that can produce highly
generic sentence representations that are robust and perform well in practice.
We will make our encoder publicly available.Comment: 11 page
- …