242 research outputs found
Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Recent research in cross-lingual word embeddings has almost exclusively
focused on offline methods, which independently train word embeddings in
different languages and map them to a shared space through linear
transformations. While several authors have questioned the underlying
isomorphism assumption, which states that word embeddings in different
languages have approximately the same structure, it is not clear whether this
is an inherent limitation of mapping approaches or a more general issue when
learning cross-lingual embeddings. So as to answer this question, we experiment
with parallel corpora, which allows us to compare offline mapping to an
extension of skip-gram that jointly learns both embedding spaces. We observe
that, under these ideal conditions, joint learning yields to more isomorphic
embeddings, is less sensitive to hubness, and obtains stronger results in
bilingual lexicon induction. We thus conclude that current mapping methods do
have strong limitations, calling for further research to jointly learn
cross-lingual embeddings with a weaker cross-lingual signal.Comment: ACL 201
Constrained Density Matching and Modeling for Cross-lingual Alignment of Contextualized Representations
Multilingual representations pre-trained with monolingual data exhibit
considerably unequal task performances across languages. Previous studies
address this challenge with resource-intensive contextualized alignment, which
assumes the availability of large parallel data, thereby leaving
under-represented language communities behind. In this work, we attribute the
data hungriness of previous alignment techniques to two limitations: (i) the
inability to sufficiently leverage data and (ii) these techniques are not
trained properly. To address these issues, we introduce supervised and
unsupervised density-based approaches named Real-NVP and GAN-Real-NVP, driven
by Normalizing Flow, to perform alignment, both dissecting the alignment of
multilingual subspaces into density matching and density modeling. We
complement these approaches with our validation criteria in order to guide the
training process. Our experiments encompass 16 alignments, including our
approaches, evaluated across 6 language pairs, synthetic data and 5 NLP tasks.
We demonstrate the effectiveness of our approaches in the scenarios of limited
and no parallel data. First, our supervised approach trained on 20k parallel
data (sentences) mostly surpasses Joint-Align and InfoXLM trained on over 100k
parallel sentences. Second, parallel data can be removed without sacrificing
performance when integrating our unsupervised approach in our bootstrapping
procedure, which is theoretically motivated to enforce equality of multilingual
subspaces. Moreover, we demonstrate the advantages of validation criteria over
validation data for guiding supervised training.Comment: ACML2022 Camera Read
Analogy Training Multilingual Encoders
Language encoders encode words and phrases in ways that capture their local semantic relatedness, but are known to be globally inconsistent. Global inconsistency can seemingly be corrected for, in part, by leveraging signals from knowledge bases, but previous results are partial and limited to monolingual English encoders. We extract a large-scale multilingual, multi-word analogy dataset from Wikidata for diagnosing and correcting for global inconsistencies and implement a four-way Siamese BERT architecture for grounding multilingual BERT (mBERT) in Wikidata through analogy training. We show that analogy training not only improves the global consistency of mBERT, as well as the isomorphism of language-specific subspaces, but also leads to significant gains on downstream tasks such as bilingual dictionary induction and sentence retrieval
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
Evaluation of cross-lingual encoders is usually performed either via
zero-shot cross-lingual transfer in supervised downstream tasks or via
unsupervised cross-lingual textual similarity. In this paper, we concern
ourselves with reference-free machine translation (MT) evaluation where we
directly compare source texts to (sometimes low-quality) system translations,
which represents a natural adversarial setup for multilingual encoders.
Reference-free evaluation holds the promise of web-scale comparison of MT
systems. We systematically investigate a range of metrics based on
state-of-the-art cross-lingual semantic representations obtained with
pretrained M-BERT and LASER. We find that they perform poorly as semantic
encoders for reference-free MT evaluation and identify their two key
limitations, namely, (a) a semantic mismatch between representations of mutual
translations and, more prominently, (b) the inability to punish
"translationese", i.e., low-quality literal translations. We propose two
partial remedies: (1) post-hoc re-alignment of the vector spaces and (2)
coupling of semantic-similarity based metrics with target-side language
modeling. In segment-level MT evaluation, our best metric surpasses
reference-based BLEU by 5.7 correlation points.Comment: ACL2020 Camera Ready (v3: several small fixes, e.g., Unicode errors
Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow Interpolation
This paper investigates an unsupervised approach towards deriving a
universal, cross-lingual word embedding space, where words with similar
semantics from different languages are close to one another. Previous
adversarial approaches have shown promising results in inducing cross-lingual
word embedding without parallel data. However, the training stage shows
instability for distant language pairs. Instead of mapping the source language
space directly to the target language space, we propose to make use of a
sequence of intermediate spaces for smooth bridging. Each intermediate space
may be conceived as a pseudo-language space and is introduced via simple linear
interpolation. This approach is modeled after domain flow in computer vision,
but with a modified objective function. Experiments on intrinsic Bilingual
Dictionary Induction tasks show that the proposed approach can improve the
robustness of adversarial models with comparable and even better precision.
Further experiments on the downstream task of Cross-Lingual Natural Language
Inference show that the proposed model achieves significant performance
improvement for distant language pairs in downstream tasks compared to
state-of-the-art adversarial and non-adversarial models
Large language models converge toward human-like concept organization
Large language models show human-like performance in knowledge extraction,
reasoning and dialogue, but it remains controversial whether this performance
is best explained by memorization and pattern matching, or whether it reflects
human-like inferential semantics and world knowledge. Knowledge bases such as
WikiData provide large-scale, high-quality representations of inferential
semantics and world knowledge. We show that large language models learn to
organize concepts in ways that are strikingly similar to how concepts are
organized in such knowledge bases. Knowledge bases model collective,
institutional knowledge, and large language models seem to induce such
knowledge from raw text. We show that bigger and better models exhibit more
human-like concept organization, across four families of language models and
three knowledge graph embeddings
- …