30 research outputs found

    Multitask learning without label correspondences

    Get PDF
    We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces for the purpose of classification, such as integrating Yahoo! and DMOZ web directories

    Distribution Regression with Minimax-Optimal Guarantee

    Get PDF
    We focus on the distribution regression problem (DRP): we regress from probability measures to Hilbert-space valued outputs, where the input distributions are only available through samples (this is the 'two-stage sampled' setting). Several important statistical and machine learning problems can be phrased within this framework including point estimation tasks without analytical solution (such as hyperparameter or entropy estimation) and multi-instance learning. However, due to the two-stage sampled nature of the problem, the theoretical analysis becomes quite challenging: to the best of our knowledge the only existing method with performance guarantees to solve the DRP task requires density estimation (which often performs poorly in practise) and the distributions to be defined on a compact Euclidean domain. We present a simple, analytically tractable alternative to solve the DRP task: we embed the distributions to a reproducing kernel Hilbert space and perform ridge regression from the embedded distributions to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency tradeoff analysis showing that the studied estimator is able to match the one-stage sampled minimax-optimal rate. This result answers a 17-year-old open question, by establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. The practical efficiency of the studied technique is illustrated in supervised entropy learning and aerosol prediction using multispectral satellite images

    Learning with Kernels

    Get PDF

    f-Divergence constrained policy improvement

    Full text link
    To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of α\alpha-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of α\alpha. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f-divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson χ2\chi^2-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of α\alpha and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems

    Kernel Exponential Family Estimation via Doubly Dual Embedding

    Get PDF
    We investigate penalized maximum log-likelihood estimation for exponential family distributions whose natural parameter resides in a reproducing kernel Hilbert space. Key to our approach is a novel technique, doubly dual embedding, that avoids computation of the partition function. This technique also allows the development of a flexible sampling strategy that amortizes the cost of Monte-Carlo sampling in the inference stage. The resulting estimator can be easily generalized to kernel conditional exponential families. We establish a connection between kernel exponential family estimation and MMD-GANs, revealing a new perspective for understanding GANs. Compared to the score matching based estimators, the proposed method improves both memory and time efficiency while enjoying stronger statistical properties, such as fully capturing smoothness in its statistical convergence rate while the score matching estimator appears to saturate. Finally, we show that the proposed estimator empirically outperforms state-of-the-artComment: 22 pages, 20 figures; AISTATS 201
    corecore