1,365 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Lessons learned in multilingual grounded language learning
Recent work has shown how to learn better visual-semantic embeddings by
leveraging image descriptions in more than one language. Here, we investigate
in detail which conditions affect the performance of this type of grounded
language learning model. We show that multilingual training improves over
bilingual training, and that low-resource languages benefit from training with
higher-resource languages. We demonstrate that a multilingual model can be
trained equally well on either translations or comparable sentence pairs, and
that annotating the same set of images in multiple language enables further
improvements via an additional caption-caption ranking objective.Comment: CoNLL 201
BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource
While pretrained language models (PLMs) primarily serve as general purpose
text encoders that can be fine-tuned for a wide variety of downstream tasks,
recent work has shown that they can also be rewired to produce high-quality
word representations (i.e., static word embeddings) and yield good performance
in type-level lexical tasks. While existing work primarily focused on lexical
specialization of PLMs in monolingual and bilingual settings, in this work we
expose massively multilingual transformers (MMTs, e.g., mBERT or XLM-R) to
multilingual lexical knowledge at scale, leveraging BabelNet as the readily
available rich source of multilingual and cross-lingual type-level lexical
knowledge. Concretely, we leverage BabelNet's multilingual synsets to create
synonym pairs across languages and then subject the MMTs (mBERT and XLM-R)
to a lexical specialization procedure guided by a contrastive objective. We
show that such massively multilingual lexical specialization brings massive
gains in two standard cross-lingual lexical tasks, bilingual lexicon induction
and cross-lingual word similarity, as well as in cross-lingual sentence
retrieval. Crucially, we observe gains for languages unseen in specialization,
indicating that the multilingual lexical specialization enables generalization
to languages with no lexical constraints. In a series of subsequent controlled
experiments, we demonstrate that the pretraining quality of word
representations in the MMT for languages involved in specialization has a much
larger effect on performance than the linguistic diversity of the set of
constraints. Encouragingly, this suggests that lexical tasks involving
low-resource languages benefit the most from lexical knowledge of resource-rich
languages, generally much more available
- …