1,418 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
We introduce a novel method for multilingual transfer that utilizes deep
contextual embeddings, pretrained in an unsupervised fashion. While contextual
embeddings have been shown to yield richer representations of meaning compared
to their static counterparts, aligning them poses a challenge due to their
dynamic nature. To this end, we construct context-independent variants of the
original monolingual spaces and utilize their mapping to derive an alignment
for the context-dependent spaces. This mapping readily supports processing of a
target language, improving transfer by context-aware embeddings. Our
experimental results demonstrate the effectiveness of this approach for
zero-shot and few-shot learning of dependency parsing. Specifically, our method
consistently outperforms the previous state-of-the-art on 6 tested languages,
yielding an improvement of 6.8 LAS points on average.Comment: NAACL 201
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding
Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and
XLM, have achieved great success in cross-lingual representation learning.
However, when applied to zero-shot cross-lingual transfer tasks, most existing
methods use only single-language input for LM finetuning, without leveraging
the intrinsic cross-lingual alignment between different languages that proves
essential for multilingual tasks. In this paper, we propose FILTER, an enhanced
fusion method that takes cross-lingual data as input for XLM finetuning.
Specifically, FILTER first encodes text input in the source language and its
translation in the target language independently in the shallow layers, then
performs cross-language fusion to extract multilingual knowledge in the
intermediate layers, and finally performs further language-specific encoding.
During inference, the model makes predictions based on the text input in the
target language and its translation in the source language. For simple tasks
such as classification, translated text in the target language shares the same
label as the source language. However, this shared label becomes less accurate
or even unavailable for more complex tasks such as question answering, NER and
POS tagging. To tackle this issue, we further propose an additional
KL-divergence self-teaching loss for model training, based on auto-generated
soft pseudo-labels for translated text in the target language. Extensive
experiments demonstrate that FILTER achieves new state of the art on two
challenging multilingual multi-task benchmarks, XTREME and XGLUE.Comment: Accepted to AAAI 2021; Top-1 Performance on XTREME
(https://sites.research.google/xtreme, September 8, 2020) and XGLUE
(https://microsoft.github.io/XGLUE, September 14, 2020) benchmar
Soft Language Clustering for Multilingual Model Pre-training
Multilingual pre-trained language models have demonstrated impressive
(zero-shot) cross-lingual transfer abilities, however, their performance is
hindered when the target language has distant typology from source languages or
when pre-training data is limited in size. In this paper, we propose XLM-P,
which contextually retrieves prompts as flexible guidance for encoding
instances conditionally. Our XLM-P enables (1) lightweight modeling of
language-invariant and language-specific knowledge across languages, and (2)
easy integration with other multilingual pre-training methods. On the tasks of
XTREME including text classification, sequence labeling, question answering,
and sentence retrieval, both base- and large-size language models pre-trained
with our proposed method exhibit consistent performance improvement.
Furthermore, it provides substantial advantages for low-resource languages in
unsupervised sentence retrieval and for target languages that differ greatly
from the source language in cross-lingual transfer
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
The milestone improvements brought about by deep representation learning and
pre-training techniques have led to large performance gains across downstream
NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large
high-quality visio-linguistic datasets for learning complementary information
(across image and text modalities). In this paper, we introduce the
Wikipedia-based Image Text (WIT) Dataset
(https://github.com/google-research-datasets/wit) to better facilitate
multimodal, multilingual learning. WIT is composed of a curated set of 37.6
million entity rich image-text examples with 11.5 million unique images across
108 Wikipedia languages. Its size enables WIT to be used as a pretraining
dataset for multimodal models, as we show when applied to downstream tasks such
as image-text retrieval. WIT has four main and unique advantages. First, WIT is
the largest multimodal dataset by the number of image-text examples by 3x (at
the time of writing). Second, WIT is massively multilingual (first of its kind)
with coverage over 100+ languages (each of which has at least 12K examples) and
provides cross-lingual texts for many images. Third, WIT represents a more
diverse set of concepts and real world entities relative to what previous
datasets cover. Lastly, WIT provides a very challenging real-world test set, as
we empirically illustrate using an image-text retrieval task as an example
Exploiting Cross-Lingual Representations For Natural Language Processing
Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages.
In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest
- …