11 research outputs found

    Advances in Neural Information Processing Systems

    Get PDF
    Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Instead, we formulate a complexity measure that captures the diversity of the observed tasks. We provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We show sample complexity reductions in comparison to solving every task in isolation in terms of our task complexity measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network

    IST Austria Thesis

    Get PDF
    Traditionally machine learning has been focusing on the problem of solving a single task in isolation. While being quite well understood, this approach disregards an important aspect of human learning: when facing a new problem, humans are able to exploit knowledge acquired from previously learned tasks. Intuitively, access to several problems simultaneously or sequentially could also be advantageous for a machine learning system, especially if these tasks are closely related. Indeed, results of many empirical studies have provided justification for this intuition. However, theoretical justifications of this idea are rather limited. The focus of this thesis is to expand the understanding of potential benefits of information transfer between several related learning problems. We provide theoretical analysis for three scenarios of multi-task learning - multiple kernel learning, sequential learning and active task selection. We also provide a PAC-Bayesian perspective on lifelong learning and investigate how the task generation process influences the generalization guarantees in this scenario. In addition, we show how some of the obtained theoretical results can be used to derive principled multi-task and lifelong learning algorithms and illustrate their performance on various synthetic and real-world datasets

    Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

    Get PDF
    We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach.Comment: 37 pages, 8 figure

    A Gang of Adversarial Bandits

    Get PDF
    We consider running multiple instances of multi-armed bandit (MAB) problems in parallel. A main motivation for this study are online recommendation systems, in which each of N users is associated with a MAB problem and the goal is to exploit users' similarity in order to learn users' preferences to K items more efficiently. We consider the adversarial MAB setting, whereby an adversary is free to choose which user and which loss to present to the learner during the learning process. Users are in a social network and the learner is aided by a-priori knowledge of the strengths of the social links between all pairs of users. It is assumed that if the social link between two users is strong then they tend to share the same action. The regret is measured relative to an arbitrary function which maps users to actions. The smoothness of the function is captured by a resistance-based dispersion measure Ψ. We present two learning algorithms, GABA-I and GABA-II which exploit the network structure to bias towards functions of low Ψ values. We show that GABA-I has an expected regret bound of O(pln(N K/Ψ)ΨKT) and per-trial time complexity of O(K ln(N)), whilst GABA-II has a weaker O(pln(N/Ψ) ln(N K/Ψ)ΨKT) regret, but a better O(ln(K) ln(N)) per-trial time complexity. We highlight improvements of both algorithms over running independent standard MABs across users

    Meta-learning with Stochastic Linear Bandits

    Get PDF
    We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation

    Online Parameter-Free Learning of Multiple Low Variance Tasks

    Get PDF
    We propose a method to learn a common bias vector for a growing sequence of low-variance tasks. Unlike state-of-the-art approaches, our method does not require tuning any hyper-parameter. Our approach is presented in the non-statistical setting and can be of two variants. The "aggressive" one updates the bias after each datapoint, the "lazy" one updates the bias only at the end of each task. We derive an across-tasks regret bound for the method. When compared to state-of-the-art approaches, the aggressive variant returns faster rates, the lazy one recovers standard rates, but with no need of tuning hyper-parameters. We then adapt the methods to the statistical setting: the aggressive variant becomes a multi-task learning method, the lazy one a meta-learning method. Experiments confirm the effectiveness of our methods in practice

    On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structure

    Full text link
    We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor). The objective is to learn the optimal (representation, predictor)-pair for each task, under the assumption that the optimal representation is common to all tasks. Within this framework, efficient learning algorithms should transfer knowledge across tasks. We consider the best-arm identification problem for a fixed confidence, where, in each round, the learner actively selects both a task, and an arm, and observes the corresponding reward. We derive instance-specific sample complexity lower bounds satisfied by any (δG,δH)(\delta_G,\delta_H)-PAC algorithm (such an algorithm identifies the best representation with probability at least 1δG1-\delta_G, and the best predictor for a task with probability at least 1δH1-\delta_H). We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as H(Glog(1/δG)+Xlog(1/δH))H(G\log(1/\delta_G)+ X\log(1/\delta_H)), with X,G,HX,G,H being, respectively, the number of tasks, representations and predictors. By comparison, this scaling is significantly better than the classical best-arm identification algorithm that scales as HGXlog(1/δ)HGX\log(1/\delta).Comment: Accepted at the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI23

    Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

    Get PDF
    We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach

    Efficient Lifelong Learning Algorithms: Regret Bounds and Statistical Guarantees

    Get PDF
    We study the Meta-Learning paradigm where the goal is to select an algorithm in a prescribed family \u2013 usually denoted as inner or within-task algorithm \u2013 that is appropriate to address a class of learning problems (tasks), sharing specific similarities. More precisely, we aim at designing a procedure, called meta-algorithm, that is able to infer this tasks\u2019 relatedness from a sequence of observed tasks and to exploit such a knowledge in order to return a within-task algorithm in the class that is best suited to solve a new similar task. We are interested in the online Meta-Learning setting, also known as Lifelong Learning. In this scenario the meta-algorithm receives the tasks sequentially and it incrementally adapts the inner algorithm on the fly as the tasks arrive. In particular, we refer to the framework in which also the within-task data are processed sequentially by the inner algorithm as Online-Within-Online (OWO) Meta-Learning, while, we use the term Online-Within-Batch (OWB) Meta-Learning to denote the setting in which the within-task data are processed in a single batch. In this work we propose an OWO Meta-Learning method based on primal-dual Online Learning. Our method is theoretically grounded and it is able to cover various types of tasks\u2019 relatedness and learning algorithms. More precisely, we focus on the family of inner algorithms given by a parametrized variant of Follow The Regularized Leader (FTRL) aiming at minimizing the withintask regularized empirical risk. The inner algorithm in this class is incrementally adapted by a FTRL meta-algorithm using the within-task minimum regularized empirical risk as the meta-loss. In order to keep the process fully online, we use the online inner algorithm to approximate the subgradients used by the meta-algorithm and we show how to exploit an upper bound on this approximation error in order to derive a cumulative error bound for the proposed method. Our analysis can be adapted to the statistical setting by two nested online-to-batch conversion steps. We also show how the proposed OWO method can provide statistical guarantees comparable to its natural more expensive OWB variant, where the inner online algorithm is substituted by the batch minimizer of the regularized empirical risk. Finally, we apply our method to two important families of learning algorithms parametrized by a bias vector or a linear feature map
    corecore