44,840 research outputs found
Multilingual Universal Sentence Encoder for Semantic Retrieval
We introduce two pre-trained retrieval focused multilingual sentence encoding
models, respectively based on the Transformer and CNN model architectures. The
models embed text from 16 languages into a single semantic space using a
multi-task trained dual-encoder that learns tied representations using
translation based bridge tasks (Chidambaram al., 2018). The models provide
performance that is competitive with the state-of-the-art on: semantic
retrieval (SR), translation pair bitext retrieval (BR) and retrieval question
answering (ReQA). On English transfer learning tasks, our sentence-level
embeddings approach, and in some cases exceed, the performance of monolingual,
English only, sentence embedding models. Our models are made available for
download on TensorFlow Hub.Comment: 6 pages, 6 tables, 2 listings, and 1 figur
Zero-shot language transfer for cross-lingual sentence retrieval using bidirectional attention model
We present a neural architecture for cross-lingual mate sentence retrieval which encodes sentences in a joint multilingual space and learns to distinguish true translation pairs from semantically related sentences across languages. The proposed model combines a recurrent sequence encoder with a bidirectional attention layer and an intra-sentence attention mechanism. This way the final fixed-size sentence representations in each training sentence pair depend on the selection of contextualized token representations from the other sentence. The representations of both sentences are then combined using the bilinear product function to predict the relevance score. We show that, coupled with a shared
multilingual word embedding space, the proposed model strongly outperforms unsupervised cross-lingual ranking functions, and that further boosts can be achieved by combining the two approaches. Most importantly, we demonstrate the model's effectiveness in zero-shot language transfer settings: our multilingual framework boosts cross-lingual sentence retrieval performance for unseen language pairs without any training examples. This enables robust cross-lingual sentence retrieval
also for pairs of resource-lean languages, without any parallel data
A Multi-Task Architecture on Relevance-based Neural Query Translation
We describe a multi-task learning approach to train a Neural Machine
Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search
query translation. The translation process for Cross-lingual Information
Retrieval (CLIR) task is usually treated as a black box and it is performed as
an independent step. However, an NMT model trained on sentence-level parallel
data is not aware of the vocabulary distribution of the retrieval corpus. We
address this problem with our multi-task learning architecture that achieves
16% improvement over a strong NMT baseline on Italian-English query-document
dataset. We show using both quantitative and qualitative analysis that our
model generates balanced and precise translations with the regularization
effect it achieves from multi-task learning paradigm.Comment: Accepted for publication at ACL 201
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus
Within the project Twenty-One, which aims at the effective dissemination of information on ecology and sustainable development, a sytem is developed that supports cross-language information retrieval in any of the four languages Dutch, English, French and German. Knowledge of this application domain is needed to enhance existing translation resources for the purpose of lexical disambiguation. This paper describes an algorithm for the automated acquisition of a translation lexicon from a parallel corpus. New about the presented algorithm is the statistical language model used. Because the algorithm is based on a symmetric translation model it becomes possible to identify one-to-many and many-to-one relations between words of a language pair. We claim that the presented method has two advantages over algorithms that have been published before. Firstly, because the translation model is more powerful, the resulting bilingual lexicon will be more accurate. Secondly, the resulting bilingual lexicon can be used to translate in both directions between a language pair. Different versions of the algorithm were evaluated on the Dutch and English version of the Agenda 21 corpus, which is a UN document on the application domain of sustainable development
- …