35 research outputs found
On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning
The emergence of unsupervised word embeddings, pre-trained on very large
monolingual text corpora, is at the core of the ongoing neural revolution in
Natural Language Processing (NLP). Initially introduced for English, such
pre-trained word embeddings quickly emerged for a number of other languages.
Subsequently, there have been a number of attempts to align the embedding
spaces across languages, which could enable a number of cross-language NLP
applications. Performing the alignment using unsupervised cross-lingual
learning (UCL) is especially attractive as it requires little data and often
rivals supervised and semi-supervised approaches. Here, we analyze popular
methods for UCL and we find that often their objectives are, intrinsically,
versions of the Wasserstein-Procrustes problem. Hence, we devise an approach to
solve Wasserstein-Procrustes in a direct way, which can be used to refine and
to improve popular UCL methods such as iterative closest point (ICP),
multilingual unsupervised and supervised embeddings (MUSE) and supervised
Procrustes methods. Our evaluation experiments on standard datasets show
sizable improvements over these approaches. We believe that our rethinking of
the Wasserstein-Procrustes problem could enable further research, thus helping
to develop better algorithms for aligning word embeddings across languages. Our
code and instructions to reproduce the experiments are available at
https://github.com/guillemram97/wp-hungarian
InfoOT: Information Maximizing Optimal Transport
Optimal transport aligns samples across distributions by minimizing the
transportation cost between them, e.g., the geometric distances. Yet, it
ignores coherence structure in the data such as clusters, does not handle
outliers well, and cannot integrate new data points. To address these
drawbacks, we propose InfoOT, an information-theoretic extension of optimal
transport that maximizes the mutual information between domains while
minimizing geometric distances. The resulting objective can still be formulated
as a (generalized) optimal transport problem, and can be efficiently solved by
projected gradient descent. This formulation yields a new projection method
that is robust to outliers and generalizes to unseen samples. Empirically,
InfoOT improves the quality of alignments across benchmarks in domain
adaptation, cross-domain retrieval, and single-cell alignment
Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment
Word translation without parallel corpora has become feasible, rivaling the
performance of supervised methods. Recent findings have shown that the accuracy
and robustness of unsupervised word translation (UWT) can be improved by making
use of visual observations, which are universal representations across
languages. In this work, we investigate the potential of using not only visual
observations but also pretrained language-image models for enabling a more
efficient and robust UWT. Specifically, we develop a novel UWT method dubbed
Word Alignment using Language-Image Pretraining (WALIP), which leverages visual
observations via the shared embedding space of images and texts provided by
CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we
retrieve word pairs with high confidences of similarity, computed using our
proposed image-based fingerprints, which define the initial pivot for the word
alignment. Second, we apply our robust Procrustes algorithm to estimate the
linear mapping between two embedding spaces, which iteratively corrects and
refines the estimated alignment. Our extensive experiments show that WALIP
improves upon the state-of-the-art performance of bilingual word alignment for
a few language pairs across different word embeddings and displays great
robustness to the dissimilarity of language pairs or training corpora for two
word embeddings.Comment: In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP Findings
Multilinguality from Static Embedding Spaces: Algorithmic, Geometric, and Data Considerations
To date, most work towards developing natural language processing (NLP) technologies has focused on the English language. At the same time, there are an estimated 7000+ living languages in existence, the majority of which lack adequate representation on the internet. To reach the world, NLP systems need to function for all users in their preferred language, and in low-data and low-compute scenarios. In this work, we study multilinguality arising from monolingual embedding spaces, considering their behavior under data, algorithmic, and geometric change. We begin with a motivating example from unsupervised machine translation (MT), which is an encouraging paradigm for bringing language technologies to the world, as it can be built using monolingual data alone. This lack of reliance on translated parallel data is important because such data is scarce for most languages. We explore unsupervised MT that extracts translation pairs from monolingual embedding spaces (the bilingual lexicon induction [BLI] task), finding that the quality of separately-trained spaces is critical to performance in BLI and downstream MT, and we show that mismatched data conditions cause performance to rapidly deteriorate (data change). From there, we analyze the BLI task further, comparing two very different approaches. We show that differing mathematical framings leads to divergent performance, which depends on the supervision available, and bring a newly-developed algorithm from combinatorial optimization to the NLP literature (algorithmic change). We propose a combination system which capitalizes upon the strengths of both framings for better performance than either alone, and expand the analysis to 40 language pairs. Finally, we address a root cause of faulty cross-lingual mapping and BLI failure—that separately-trained monolingual word embedding spaces are non-isomorphic—by controlling their relative geometric similarity at training time. Our method is a very different way of approaching BLI, starting with embedding training itself, and improves downstream BLI performance under challenging data scenarios of algorithm and domain mismatch (geometric change)
Uncovering Challenges of Solving the Continuous Gromov-Wasserstein Problem
Recently, the Gromov-Wasserstein Optimal Transport (GWOT) problem has
attracted the special attention of the ML community. In this problem, given two
distributions supported on two (possibly different) spaces, one has to find the
most isometric map between them. In the discrete variant of GWOT, the task is
to learn an assignment between given discrete sets of points. In the more
advanced continuous formulation, one aims at recovering a parametric mapping
between unknown continuous distributions based on i.i.d. samples derived from
them. The clear geometrical intuition behind the GWOT makes it a natural choice
for several practical use cases, giving rise to a number of proposed solvers.
Some of them claim to solve the continuous version of the problem. At the same
time, GWOT is notoriously hard, both theoretically and numerically. Moreover,
all existing continuous GWOT solvers still heavily rely on discrete techniques.
Natural questions arise: to what extent existing methods unravel GWOT problem,
what difficulties they encounter, and under which conditions they are
successful. Our benchmark paper is an attempt to answer these questions. We
specifically focus on the continuous GWOT as the most interesting and debatable
setup. We crash-test existing continuous GWOT approaches on different
scenarios, carefully record and analyze the obtained results, and identify
issues. Our findings experimentally testify that the scientific community is
still missing a reliable continuous GWOT solver, which necessitates further
research efforts. As the first step in this direction, we propose a new
continuous GWOT method which does not rely on discrete techniques and partially
solves some of the problems of the competitors. Our code is available at
https://github.com/Ark-130994/GW-Solvers
Preference-based Representation Learning for Collections
In this thesis, I make some contributions to the development of representation learning in the setting of external constraints and noisy supervision. A setting of external constraints refers to the scenario in which the learner is forced to output a latent representation of the given data points while enforcing some particular conditions. These conditions can be geometrical constraints, for example forcing the vector embeddings to be close to each other based on a particular relations, or forcing the embedding vectors to lie in a particular manifold, such as the manifold of vectors whose elements sum to 1, or even more complex constraints. The objects of interest in this thesis are elements of a collection X in an abstract space that is endowed with a similarity function which quantifies how similar two objects are. A collection is defined as a set of items in which the order is ignored but the multiplicity is relevant. Various types of collections are used as inputs or outputs in the machine learning field. The most common are perhaps sequences and sets.
Besides studying representation learning approaches in presence of external constraints, in this thesis we tackle the case in which the evaluation of this similarity function is not directly possible. In recent years, the machine learning setting of having only binary answers to some comparisons for tuples of elements has gained interest. Learning good representations from a scenario in which a clear distance information cannot be obtained is of fundamental importance. This problem is opposite to the standard machine learning setting where the similarity function between elements can be directly evaluated. Moreover, we tackle the case in which the learner is given noisy supervision signals, with a certain probability for the label to be incorrect. Another research question that was studied in this thesis is how to assess the quality of the learned representations and how a learner can convey the uncertainty about this representation.
After the introductory Chapter 1, the thesis is structured in three main parts. In the first part, I present the results of representation learning based on data points that are sequences. The focus in this part is on sentences and permutations, particular types of sequences. The first contribution of this part consists in enforcing analogical relations between sentences and the second is learning appropriate representations for permutations, which are particular mathematical objects, while using neural networks. The second part of this thesis tackles the question of learning perceptual embeddings from binary and noisy comparisons. In machine learning, this problem is referred as ordinal embedding problem. This part contains two chapters which elaborate two different aspects of the problem: appropriately conveying the uncertainty of the representation and learning the embeddings from aggregated and noisy feedback. Finally the third part of the thesis, contains applications of the findings of the previous part, namely unsupervised alignment of clouds of embedding vectors and entity set extension
Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People
Conversational tones -- the manners and attitudes in which speakers
communicate -- are essential to effective communication. Amidst the increasing
popularization of Large Language Models (LLMs) over recent years, it becomes
necessary to characterize the divergences in their conversational tones
relative to humans. However, existing investigations of conversational
modalities rely on pre-existing taxonomies or text corpora, which suffer from
experimenter bias and may not be representative of real-world distributions for
the studies' psycholinguistic domains. Inspired by methods from cognitive
science, we propose an iterative method for simultaneously eliciting
conversational tones and sentences, where participants alternate between two
tasks: (1) one participant identifies the tone of a given sentence and (2) a
different participant generates a sentence based on that tone. We run 100
iterations of this process with human participants and GPT-4, then obtain a
dataset of sentences and frequent conversational tones. In an additional
experiment, humans and GPT-4 annotated all sentences with all tones. With data
from 1,339 human participants, 33,370 human judgments, and 29,900 GPT-4
queries, we show how our approach can be used to create an interpretable
geometric representation of relations between conversational tones in humans
and GPT-4. This work demonstrates how combining ideas from machine learning and
cognitive science can address challenges in human-computer interactions.Comment: Accepted to Main Conference at ACL 202
