30 research outputs found
Multitask learning without label correspondences
We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces for the purpose of classification, such as integrating Yahoo! and DMOZ web directories
Distribution Regression with Minimax-Optimal Guarantee
We focus on the distribution regression problem (DRP): we regress from probability measures to Hilbert-space valued outputs, where the input distributions are only available through samples (this is the 'two-stage sampled' setting). Several important statistical and machine learning problems can be phrased within this framework including point estimation tasks without analytical solution (such as hyperparameter or entropy estimation) and multi-instance learning. However, due to the two-stage sampled nature of the problem, the theoretical analysis becomes quite challenging: to the best of our knowledge the only existing method with performance guarantees to solve the DRP task requires density estimation (which often performs poorly in practise) and the distributions to be defined on a compact Euclidean domain. We present a simple, analytically tractable alternative to solve the DRP task: we embed the distributions to a reproducing kernel Hilbert space and perform ridge regression from the embedded distributions to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency tradeoff analysis showing that the studied estimator is able to match the one-stage sampled minimax-optimal rate. This result answers a 17-year-old open question, by establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. The practical efficiency of the studied technique is illustrated in supervised entropy learning and aerosol prediction using multispectral satellite images
f-Divergence constrained policy improvement
To ensure stability of learning, state-of-the-art generalized policy
iteration algorithms augment the policy improvement step with a trust region
constraint bounding the information loss. The size of the trust region is
commonly determined by the Kullback-Leibler (KL) divergence, which not only
captures the notion of distance well but also yields closed-form solutions. In
this paper, we consider a more general class of f-divergences and derive the
corresponding policy update rules. The generic solution is expressed through
the derivative of the convex conjugate function to f and includes the KL
solution as a special case. Within the class of f-divergences, we further focus
on a one-parameter family of -divergences to study effects of the
choice of divergence on policy improvement. Previously known as well as new
policy updates emerge for different values of . We show that every type
of policy update comes with a compatible policy evaluation resulting from the
chosen f-divergence. Interestingly, the mean-squared Bellman error minimization
is closely related to policy evaluation with the Pearson -divergence
penalty, while the KL divergence results in the soft-max policy update and a
log-sum-exp critic. We carry out asymptotic analysis of the solutions for
different values of and demonstrate the effects of using different
divergence functions on a multi-armed bandit problem and on common standard
reinforcement learning problems
Kernel Exponential Family Estimation via Doubly Dual Embedding
We investigate penalized maximum log-likelihood estimation for exponential
family distributions whose natural parameter resides in a reproducing kernel
Hilbert space. Key to our approach is a novel technique, doubly dual embedding,
that avoids computation of the partition function. This technique also allows
the development of a flexible sampling strategy that amortizes the cost of
Monte-Carlo sampling in the inference stage. The resulting estimator can be
easily generalized to kernel conditional exponential families. We establish a
connection between kernel exponential family estimation and MMD-GANs, revealing
a new perspective for understanding GANs. Compared to the score matching based
estimators, the proposed method improves both memory and time efficiency while
enjoying stronger statistical properties, such as fully capturing smoothness in
its statistical convergence rate while the score matching estimator appears to
saturate. Finally, we show that the proposed estimator empirically outperforms
state-of-the-artComment: 22 pages, 20 figures; AISTATS 201