9 research outputs found
A Sample Complexity Separation between Non-Convex and Convex Meta-Learning
One popular trend in meta-learning is to learn from many training tasks a
common initialization for a gradient-based method that can be used to solve a
new task with few samples. The theory of meta-learning is still in its early
stages, with several recent learning-theoretic analyses of methods such as
Reptile [Nichol et al., 2018] being for convex models. This work shows that
convex-case analysis might be insufficient to understand the success of
meta-learning, and that even for non-convex models it is important to look
inside the optimization black-box, specifically at properties of the
optimization trajectory. We construct a simple meta-learning instance that
captures the problem of one-dimensional subspace learning. For the convex
formulation of linear regression on this instance, we show that the new task
sample complexity of any initialization-based meta-learning algorithm is
, where is the input dimension. In contrast, for the non-convex
formulation of a two layer linear network on the same instance, we show that
both Reptile and multi-task representation learning can have new task sample
complexity of , demonstrating a separation from convex
meta-learning. Crucially, analyses of the training dynamics of these methods
reveal that they can meta-learn the correct subspace onto which the data should
be projected.Comment: 34 page
Provable Representation Learning for Imitation Learning via Bi-level Optimization
A common strategy in modern learning systems is to learn a representation
that is useful for many tasks, a.k.a. representation learning. We study this
strategy in the imitation learning setting for Markov decision processes (MDPs)
where multiple experts' trajectories are available. We formulate representation
learning as a bi-level optimization problem where the "outer" optimization
tries to learn the joint representation and the "inner" optimization encodes
the imitation learning setup and tries to learn task-specific parameters. We
instantiate this framework for the imitation learning settings of behavior
cloning and observation-alone. Theoretically, we show using our framework that
representation learning can provide sample complexity benefits for imitation
learning in both settings. We also provide proof-of-concept experiments to
verify our theory.Comment: 26 page
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Motivations like domain adaptation, transfer learning, and feature learning
have fueled interest in inducing embeddings for rare or unseen words, n-grams,
synsets, and other textual features. This paper introduces a la carte
embedding, a simple and general alternative to the usual word2vec-based
approaches for building such representations that is based upon recent
theoretical results for GloVe-like embeddings. Our method relies mainly on a
linear transformation that is efficiently learnable using pretrained word
vectors and linear regression. This transform is applicable on the fly in the
future when a new text feature or rare word is encountered, even if only a
single usage example is available. We introduce a new dataset showing how the a
la carte method requires fewer examples of words in context to learn
high-quality embeddings and we obtain state-of-the-art results on a nonce task
and some unsupervised document classification tasks.Comment: 11 pages, 2 figures, To appear in ACL 201
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Motivations like domain adaptation, transfer learning, and feature learning
have fueled interest in inducing embeddings for rare or unseen words, n-grams,
synsets, and other textual features. This paper introduces a la carte
embedding, a simple and general alternative to the usual word2vec-based
approaches for building such representations that is based upon recent
theoretical results for GloVe-like embeddings. Our method relies mainly on a
linear transformation that is efficiently learnable using pretrained word
vectors and linear regression. This transform is applicable on the fly in the
future when a new text feature or rare word is encountered, even if only a
single usage example is available. We introduce a new dataset showing how the a
la carte method requires fewer examples of words in context to learn
high-quality embeddings and we obtain state-of-the-art results on a nonce task
and some unsupervised document classification tasks.Comment: 11 pages, 2 figures, To appear in ACL 201
Recommended from our members
A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice