Search CORE

35 research outputs found

On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning

Author: Dangovski Rumen
Nakov Preslav
Ramírez Guillem
Soljačić Marin
Publication venue
Publication date: 18/07/2020
Field of study

The emergence of unsupervised word embeddings, pre-trained on very large monolingual text corpora, is at the core of the ongoing neural revolution in Natural Language Processing (NLP). Initially introduced for English, such pre-trained word embeddings quickly emerged for a number of other languages. Subsequently, there have been a number of attempts to align the embedding spaces across languages, which could enable a number of cross-language NLP applications. Performing the alignment using unsupervised cross-lingual learning (UCL) is especially attractive as it requires little data and often rivals supervised and semi-supervised approaches. Here, we analyze popular methods for UCL and we find that often their objectives are, intrinsically, versions of the Wasserstein-Procrustes problem. Hence, we devise an approach to solve Wasserstein-Procrustes in a direct way, which can be used to refine and to improve popular UCL methods such as iterative closest point (ICP), multilingual unsupervised and supervised embeddings (MUSE) and supervised Procrustes methods. Our evaluation experiments on standard datasets show sizable improvements over these approaches. We believe that our rethinking of the Wasserstein-Procrustes problem could enable further research, thus helping to develop better algorithms for aligning word embeddings across languages. Our code and instructions to reproduce the experiments are available at https://github.com/guillemram97/wp-hungarian

arXiv.org e-Print Archive

InfoOT: Information Maximizing Optimal Transport

Author: Alvarez-Melis David
Chuang Ching-Yao
Jegelka Stefanie
Publication venue
Publication date: 06/10/2022
Field of study

Optimal transport aligns samples across distributions by minimizing the transportation cost between them, e.g., the geometric distances. Yet, it ignores coherence structure in the data such as clusters, does not handle outliers well, and cannot integrate new data points. To address these drawbacks, we propose InfoOT, an information-theoretic extension of optimal transport that maximizes the mutual information between domains while minimizing geometric distances. The resulting objective can still be formulated as a (generalized) optimal transport problem, and can be efficiently solved by projected gradient descent. This formulation yields a new projection method that is robust to outliers and generalizes to unseen samples. Empirically, InfoOT improves the quality of alignments across benchmarks in domain adaptation, cross-domain retrieval, and single-cell alignment

arXiv.org e-Print Archive

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

Author: Dinh Tuan
Hu Junjie
Lee Kangwook
Ming Yifei
Ossowski Timothy
Papailiopoulos Dimitris
Rajput Shashank
Sohn Jy-yong
Publication venue
Publication date: 07/11/2022
Field of study

Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.Comment: In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings

arXiv.org e-Print Archive

Multilinguality from Static Embedding Spaces: Algorithmic, Geometric, and Data Considerations

Author: Marchisio Kelly V
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 22/09/2023
Field of study

To date, most work towards developing natural language processing (NLP) technologies has focused on the English language. At the same time, there are an estimated 7000+ living languages in existence, the majority of which lack adequate representation on the internet. To reach the world, NLP systems need to function for all users in their preferred language, and in low-data and low-compute scenarios. In this work, we study multilinguality arising from monolingual embedding spaces, considering their behavior under data, algorithmic, and geometric change. We begin with a motivating example from unsupervised machine translation (MT), which is an encouraging paradigm for bringing language technologies to the world, as it can be built using monolingual data alone. This lack of reliance on translated parallel data is important because such data is scarce for most languages. We explore unsupervised MT that extracts translation pairs from monolingual embedding spaces (the bilingual lexicon induction [BLI] task), finding that the quality of separately-trained spaces is critical to performance in BLI and downstream MT, and we show that mismatched data conditions cause performance to rapidly deteriorate (data change). From there, we analyze the BLI task further, comparing two very different approaches. We show that differing mathematical framings leads to divergent performance, which depends on the supervision available, and bring a newly-developed algorithm from combinatorial optimization to the NLP literature (algorithmic change). We propose a combination system which capitalizes upon the strengths of both framings for better performance than either alone, and expand the analysis to 40 language pairs. Finally, we address a root cause of faulty cross-lingual mapping and BLI failure—that separately-trained monolingual word embedding spaces are non-isomorphic—by controlling their relative geometric similarity at training time. Our method is a very different way of approaching BLI, starting with embedding training itself, and improves downstream BLI performance under challenging data scenarios of algorithm and domain mismatch (geometric change)

JScholarship (Johns Hopkins Univ.)

Uncovering Challenges of Solving the Continuous Gromov-Wasserstein Problem

Author: Burnaev Evgeny
Carrasco Xavier Aramayo
Korotin Alexander
Mokrov Petr
Nekrashevich Maksim
Publication venue
Publication date: 17/06/2024
Field of study

Recently, the Gromov-Wasserstein Optimal Transport (GWOT) problem has attracted the special attention of the ML community. In this problem, given two distributions supported on two (possibly different) spaces, one has to find the most isometric map between them. In the discrete variant of GWOT, the task is to learn an assignment between given discrete sets of points. In the more advanced continuous formulation, one aims at recovering a parametric mapping between unknown continuous distributions based on i.i.d. samples derived from them. The clear geometrical intuition behind the GWOT makes it a natural choice for several practical use cases, giving rise to a number of proposed solvers. Some of them claim to solve the continuous version of the problem. At the same time, GWOT is notoriously hard, both theoretically and numerically. Moreover, all existing continuous GWOT solvers still heavily rely on discrete techniques. Natural questions arise: to what extent existing methods unravel GWOT problem, what difficulties they encounter, and under which conditions they are successful. Our benchmark paper is an attempt to answer these questions. We specifically focus on the continuous GWOT as the most interesting and debatable setup. We crash-test existing continuous GWOT approaches on different scenarios, carefully record and analyze the obtained results, and identify issues. Our findings experimentally testify that the scientific community is still missing a reliable continuous GWOT solver, which necessitates further research efforts. As the first step in this direction, we propose a new continuous GWOT method which does not rely on discrete techniques and partially solves some of the problems of the competitors. Our code is available at https://github.com/Ark-130994/GW-Solvers

arXiv.org e-Print Archive

Preference-based Representation Learning for Collections

Author: Diallo Aissatou
Publication venue
Publication date: 01/01/2022
Field of study

In this thesis, I make some contributions to the development of representation learning in the setting of external constraints and noisy supervision. A setting of external constraints refers to the scenario in which the learner is forced to output a latent representation of the given data points while enforcing some particular conditions. These conditions can be geometrical constraints, for example forcing the vector embeddings to be close to each other based on a particular relations, or forcing the embedding vectors to lie in a particular manifold, such as the manifold of vectors whose elements sum to 1, or even more complex constraints. The objects of interest in this thesis are elements of a collection X in an abstract space that is endowed with a similarity function which quantifies how similar two objects are. A collection is defined as a set of items in which the order is ignored but the multiplicity is relevant. Various types of collections are used as inputs or outputs in the machine learning field. The most common are perhaps sequences and sets. Besides studying representation learning approaches in presence of external constraints, in this thesis we tackle the case in which the evaluation of this similarity function is not directly possible. In recent years, the machine learning setting of having only binary answers to some comparisons for tuples of elements has gained interest. Learning good representations from a scenario in which a clear distance information cannot be obtained is of fundamental importance. This problem is opposite to the standard machine learning setting where the similarity function between elements can be directly evaluated. Moreover, we tackle the case in which the learner is given noisy supervision signals, with a certain probability for the label to be incorrect. Another research question that was studied in this thesis is how to assess the quality of the learned representations and how a learner can convey the uncertainty about this representation. After the introductory Chapter 1, the thesis is structured in three main parts. In the first part, I present the results of representation learning based on data points that are sequences. The focus in this part is on sentences and permutations, particular types of sequences. The first contribution of this part consists in enforcing analogical relations between sentences and the second is learning appropriate representations for permutations, which are particular mathematical objects, while using neural networks. The second part of this thesis tackles the question of learning perceptual embeddings from binary and noisy comparisons. In machine learning, this problem is referred as ordinal embedding problem. This part contains two chapters which elaborate two different aspects of the problem: appropriately conveying the uncertainty of the representation and learning the embeddings from aggregated and noisy feedback. Finally the third part of the thesis, contains applications of the findings of the previous part, namely unsupervised alignment of clouds of embedding vectors and entity set extension

TUbiblio

tuprints

Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

Author: Huang Dun-Ming
Jacoby Nori
Marjieh Raja
Sucholutsky Ilia
Van Rijn Pol
Publication venue
Publication date: 06/06/2024
Field of study

Conversational tones -- the manners and attitudes in which speakers communicate -- are essential to effective communication. Amidst the increasing popularization of Large Language Models (LLMs) over recent years, it becomes necessary to characterize the divergences in their conversational tones relative to humans. However, existing investigations of conversational modalities rely on pre-existing taxonomies or text corpora, which suffer from experimenter bias and may not be representative of real-world distributions for the studies' psycholinguistic domains. Inspired by methods from cognitive science, we propose an iterative method for simultaneously eliciting conversational tones and sentences, where participants alternate between two tasks: (1) one participant identifies the tone of a given sentence and (2) a different participant generates a sentence based on that tone. We run 100 iterations of this process with human participants and GPT-4, then obtain a dataset of sentences and frequent conversational tones. In an additional experiment, humans and GPT-4 annotated all sentences with all tones. With data from 1,339 human participants, 33,370 human judgments, and 29,900 GPT-4 queries, we show how our approach can be used to create an interpretable geometric representation of relations between conversational tones in humans and GPT-4. This work demonstrates how combining ideas from machine learning and cognitive science can address challenges in human-computer interactions.Comment: Accepted to Main Conference at ACL 202

arXiv.org e-Print Archive