Search CORE

10 research outputs found

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Author: Arora Sanjeev
Khandeparkar Hrishikesh
Khodak Mikhail
Plevrakis Orestis
Saunshi Nikunj
Publication venue
Publication date: 01/01/2019
Field of study

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.Comment: 19 pages, 5 figure

arXiv.org e-Print Archive

Princeton University Open Access Repository

A Sample Complexity Separation between Non-Convex and Convex Meta-Learning

Author: Arora Sanjeev
Khodak Mikhail
Saunshi Nikunj
Zhang Yi
Publication venue
Publication date: 01/01/2020
Field of study

One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex models. This work shows that convex-case analysis might be insufficient to understand the success of meta-learning, and that even for non-convex models it is important to look inside the optimization black-box, specifically at properties of the optimization trajectory. We construct a simple meta-learning instance that captures the problem of one-dimensional subspace learning. For the convex formulation of linear regression on this instance, we show that the new task sample complexity of any initialization-based meta-learning algorithm is

\Omega(d)

, where

d

is the input dimension. In contrast, for the non-convex formulation of a two layer linear network on the same instance, we show that both Reptile and multi-task representation learning can have new task sample complexity of

\mathcal{O}(1)

, demonstrating a separation from convex meta-learning. Crucially, analyses of the training dynamics of these methods reveal that they can meta-learn the correct subspace onto which the data should be projected.Comment: 34 page

arXiv.org e-Print Archive

Princeton University Open Access Repository

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

Author: Arora Sanjeev
Gupta Arushi
Lyu Kaifeng
Saunshi Nikunj
Yu Dingli
Publication venue
Publication date: 05/11/2022
Field of study

Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the

k

highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.Comment: NeurIPS 2022 (Oral

arXiv.org e-Print Archive

Provable Representation Learning for Imitation Learning via Bi-level Optimization

Author: Arora Sanjeev
Du Simon S.
Kakade Sham
Luo Yuping
Saunshi Nikunj
Publication venue
Publication date: 01/01/2020
Field of study

A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization problem where the "outer" optimization tries to learn the joint representation and the "inner" optimization encodes the imitation learning setup and tries to learn task-specific parameters. We instantiate this framework for the imitation learning settings of behavior cloning and observation-alone. Theoretically, we show using our framework that representation learning can provide sample complexity benefits for imitation learning in both settings. We also provide proof-of-concept experiments to verify our theory.Comment: 26 page

arXiv.org e-Print Archive

Princeton University Open Access Repository

A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

Author: Arora Sanjeev
Khodak Mikhail
Liang Yingyu
Ma Tengyu
Saunshi Nikunj
Stewart Brandon
Publication venue
Publication date: 01/01/2018
Field of study

Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings. Our method relies mainly on a linear transformation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable on the fly in the future when a new text feature or rare word is encountered, even if only a single usage example is available. We introduce a new dataset showing how the a la carte method requires fewer examples of words in context to learn high-quality embeddings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.Comment: 11 pages, 2 figures, To appear in ACL 201

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

Author: Khodak Mikhail
Saunshi Nikunj
Liang Yingyu
Ma Tengyu
Stewart Brandon
Arora Sanjeev
Publication venue
Publication date: 01/06/2016
Field of study

arXiv.org e-Print Archive

Directory of Open Access Journals

OpenEdition

Towards Understanding Self-Supervised Representation Learning

Author: Saunshi Nikunj Umesh
Publication venue: Princeton, NJ : Princeton University
Publication date: 01/01/2022
Field of study

While supervised learning sparked the deep learning boom, it has some critical shortcomings: (1) it requires an abundance of expensive labeled data, and (2) it solves tasks from scratch rather than the human-like approach of leveraging knowledge and skills acquired from prior experiences. Pre-training has emerged as an alternative and effective paradigm, to overcome these shortcomings, whereby a model is first trained using easily acquirable data, and later used to solve downstream tasks of interest with much fewer labeled data than supervised learning. Pre-training using unlabeled data, a.k.a. self-supervised learning, has been especially revolutionary, with successes in diverse domains: text, vision, speech, etc. This raises an interesting and challenging question: why should pre-training on unlabeled data help with seemingly unrelated downstream tasks? In this thesis we present works that initiate and build a theoretical framework to study why self-supervised learning is beneficial for downstream tasks. The framework is applied to methods like contrastive learning, auto-regressive language modeling and self-prediction based methods. Central to the framework is the idea that pre-training helps learn low-dimensional representations of data, that subsequently help solve downstream tasks of interest with linear classifiers, requiring fewer labeled data. A common theme is to formalize what are desirable properties of the unlabeled data distribution that is used to construct the self-supervised learning task. Under appropriate formalizations, it can be shown that approximately minimizing the right pre-training objectives can extract the downstream signal that is implicitly encoded in the unlabeled data distribution. Finally it is shown that this signal can be decoded from the learned representations using linear classifiers, thus providing a formalization for transference of “skills and knowledge” across tasks

Dataspace

Recommended from our members

A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Author: Arora Sanjeev
Khodak Mikhail
Saunshi Nikunj
Vodrahalli Kiran
Publication venue
Publication date: 01/01/2018
Field of study

Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice

Princeton University Open Access Repository