11 research outputs found
Advances in Neural Information Processing Systems
Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Instead, we formulate a complexity measure that captures the diversity of the observed tasks. We provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We show sample complexity reductions in comparison to solving every task in isolation in terms of our task complexity measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network
IST Austria Thesis
Traditionally machine learning has been focusing on the problem of solving a single
task in isolation. While being quite well understood, this approach disregards an
important aspect of human learning: when facing a new problem, humans are able to
exploit knowledge acquired from previously learned tasks. Intuitively, access to several
problems simultaneously or sequentially could also be advantageous for a machine
learning system, especially if these tasks are closely related. Indeed, results of many
empirical studies have provided justification for this intuition. However, theoretical
justifications of this idea are rather limited.
The focus of this thesis is to expand the understanding of potential benefits of information
transfer between several related learning problems. We provide theoretical
analysis for three scenarios of multi-task learning - multiple kernel learning, sequential
learning and active task selection. We also provide a PAC-Bayesian perspective on
lifelong learning and investigate how the task generation process influences the generalization
guarantees in this scenario. In addition, we show how some of the obtained
theoretical results can be used to derive principled multi-task and lifelong learning
algorithms and illustrate their performance on various synthetic and real-world datasets
Learning-to-Learn Stochastic Gradient Descent with Biased Regularization
We study the problem of learning-to-learn: inferring a learning algorithm
that works well on tasks sampled from an unknown distribution. As class of
algorithms we consider Stochastic Gradient Descent on the true risk regularized
by the square euclidean distance to a bias vector. We present an average excess
risk bound for such a learning algorithm. This result quantifies the potential
benefit of using a bias vector with respect to the unbiased case. We then
address the problem of estimating the bias from a sequence of tasks. We propose
a meta-algorithm which incrementally updates the bias, as new tasks are
observed. The low space and time complexity of this approach makes it appealing
in practice. We provide guarantees on the learning ability of the
meta-algorithm. A key feature of our results is that, when the number of tasks
grows and their variance is relatively small, our learning-to-learn approach
has a significant advantage over learning each task in isolation by Stochastic
Gradient Descent without a bias term. We report on numerical experiments which
demonstrate the effectiveness of our approach.Comment: 37 pages, 8 figure
A Gang of Adversarial Bandits
We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of N users is associated with a MAB problem and the goal is to exploit users' similarity in order to learn users' preferences to K items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure Ψ. We present two learning algorithms, GABA-I and GABA-II which exploit the network structure to bias towards functions of low Ψ values. We show that GABA-I has an expected regret bound of O(pln(N K/Ψ)ΨKT) and per-trial time complexity of O(K ln(N)), whilst GABA-II has a weaker O(pln(N/Ψ) ln(N K/Ψ)ΨKT) regret, but a better O(ln(K) ln(N)) per-trial time complexity. We highlight improvements of both algorithms over running independent standard MABs across users
Meta-learning with Stochastic Linear Bandits
We investigate meta-learning procedures in the setting of stochastic linear
bandits tasks. The goal is to select a learning algorithm which works well on
average over a class of bandits tasks, that are sampled from a
task-distribution. Inspired by recent work on learning-to-learn linear
regression, we consider a class of bandit algorithms that implement a
regularized version of the well-known OFUL algorithm, where the regularization
is a square euclidean distance to a bias vector. We first study the benefit of
the biased OFUL algorithm in terms of regret minimization. We then propose two
strategies to estimate the bias within the learning-to-learn setting. We show
both theoretically and experimentally, that when the number of tasks grows and
the variance of the task-distribution is small, our strategies have a
significant advantage over learning the tasks in isolation
Online Parameter-Free Learning of Multiple Low Variance Tasks
We propose a method to learn a common bias vector for a growing sequence of
low-variance tasks. Unlike state-of-the-art approaches, our method does not
require tuning any hyper-parameter. Our approach is presented in the
non-statistical setting and can be of two variants. The "aggressive" one
updates the bias after each datapoint, the "lazy" one updates the bias only at
the end of each task. We derive an across-tasks regret bound for the method.
When compared to state-of-the-art approaches, the aggressive variant returns
faster rates, the lazy one recovers standard rates, but with no need of tuning
hyper-parameters. We then adapt the methods to the statistical setting: the
aggressive variant becomes a multi-task learning method, the lazy one a
meta-learning method. Experiments confirm the effectiveness of our methods in
practice
On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structure
We investigate the sample complexity of learning the optimal arm for
multi-task bandit problems. Arms consist of two components: one that is shared
across tasks (that we call representation) and one that is task-specific (that
we call predictor). The objective is to learn the optimal (representation,
predictor)-pair for each task, under the assumption that the optimal
representation is common to all tasks. Within this framework, efficient
learning algorithms should transfer knowledge across tasks. We consider the
best-arm identification problem for a fixed confidence, where, in each round,
the learner actively selects both a task, and an arm, and observes the
corresponding reward. We derive instance-specific sample complexity lower
bounds satisfied by any -PAC algorithm (such an algorithm
identifies the best representation with probability at least , and
the best predictor for a task with probability at least ). We
devise an algorithm OSRL-SC whose sample complexity approaches the lower bound,
and scales at most as , with
being, respectively, the number of tasks, representations and predictors. By
comparison, this scaling is significantly better than the classical best-arm
identification algorithm that scales as .Comment: Accepted at the Thirty-Seventh AAAI Conference on Artificial
Intelligence (AAAI23
Learning-to-Learn Stochastic Gradient Descent with Biased Regularization
We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach
Efficient Lifelong Learning Algorithms: Regret Bounds and Statistical Guarantees
We study the Meta-Learning paradigm where the goal is to select an algorithm in a prescribed family \u2013 usually denoted as inner or within-task algorithm \u2013 that is appropriate to address a class of learning problems (tasks), sharing specific similarities. More precisely, we aim at designing a procedure, called meta-algorithm, that is able to infer this tasks\u2019 relatedness from a sequence of observed tasks and to exploit such a knowledge in order to return a within-task algorithm in the class that is best suited to solve a new similar task. We are interested in the online Meta-Learning setting, also known as Lifelong Learning. In this scenario the meta-algorithm receives the tasks sequentially and it incrementally adapts the inner algorithm on the fly as the tasks arrive. In particular, we refer to the framework in which also the within-task data are processed sequentially by the inner algorithm as Online-Within-Online (OWO) Meta-Learning, while, we use the term Online-Within-Batch (OWB) Meta-Learning to denote the setting in which the within-task data are processed in a single batch. In this work we propose an OWO Meta-Learning method based on primal-dual Online Learning. Our method is theoretically grounded and it is able to cover various types of tasks\u2019 relatedness and learning algorithms. More precisely, we focus on the family of inner algorithms given by a parametrized variant of Follow The Regularized Leader (FTRL) aiming at minimizing the withintask regularized empirical risk. The inner algorithm in this class is incrementally adapted by a FTRL meta-algorithm using the within-task minimum regularized empirical risk as the meta-loss. In order to keep the process fully online, we use the online inner algorithm to approximate the subgradients used by the meta-algorithm and we show how to exploit an upper bound on this approximation error in order to derive a cumulative error bound for the proposed method. Our analysis can be adapted to the statistical setting by two nested online-to-batch conversion steps. We also show how the proposed OWO method can provide statistical guarantees comparable to its natural more expensive OWB variant, where the inner online algorithm is substituted by the batch minimizer of the regularized empirical risk. Finally, we apply our method to two important families of learning algorithms parametrized by a bias vector or a linear feature map