87 research outputs found
Multi-Output Learning via Spectral Filtering
In this paper we study a class of regularized kernel methods for vector-valued learning which are based on filtering the spectrum of the kernel matrix. The considered methods include Tikhonov regularization as a special case, as well as interesting alternatives such as vector-valued extensions of L2 boosting. Computational properties are discussed for various examples of kernels for vector-valued functions and the benefits of iterative techniques are illustrated. Generalizing previous results for the scalar case, we show finite sample bounds for the excess risk of the obtained estimator and, in turn, these results allow to prove consistency both for regression and multi-category classification. Finally, we present some promising results of the proposed algorithms on artificial and real data
Adaptive Kernel Methods Using the Balancing Principle
The regularization parameter choice is a fundamental problem in supervised learning since the performance of most algorithms crucially depends on the choice of one or more of such parameters. In particular a main theoretical issue regards the amount of prior knowledge on the problem needed to suitably choose the regularization parameter and obtain learning rates. In this paper we present a strategy, the balancing principle, to choose the regularization parameter without knowledge of the regularity of the target function. Such a choice adaptively achieves the best error rate. Our main result applies to regularization algorithms in reproducing kernel Hilbert space with the square loss, though we also study how a similar principle can be used in other situations. As a straightforward corollary we can immediately derive adaptive parameter choice for various kernel methods recently studied. Numerical experiments with the proposed parameter choice rules are also presented
What can be learnt with wide convolutional neural networks?
Understanding how convolutional neural networks (CNNs) can efficiently learn
high-dimensional functions remains a fundamental challenge. A popular belief is
that these models harness the local and hierarchical structure of natural data
such as images. Yet, we lack a quantitative understanding of how such structure
affects performance, e.g. the rate of decay of the generalisation error with
the number of training samples. In this paper, we study deep CNNs in the kernel
regime. First, we show that the spectrum of the corresponding kernel inherits
the hierarchical structure of the network, and we characterise its asymptotics.
Then, we use this result together with generalisation bounds to prove that deep
CNNs adapt to the spatial scale of the target function. In particular, we find
that if the target function depends on low-dimensional subsets of adjacent
input variables, then the rate of decay of the error is controlled by the
effective dimensionality of these subsets. Conversely, if the teacher function
depends on the full set of input variables, then the error rate is inversely
proportional to the input dimension. We conclude by computing the rate when a
deep CNN is trained on the output of another deep CNN with randomly-initialised
parameters. Interestingly, we find that, despite their hierarchical structure,
the functions generated by deep CNNs are too rich to be efficiently learnable
in high dimension
Adaptive Distributed Kernel Ridge Regression: A Feasible Distributed Learning Scheme for Data Silos
Data silos, mainly caused by privacy and interoperability, significantly
constrain collaborations among different organizations with similar data for
the same purpose. Distributed learning based on divide-and-conquer provides a
promising way to settle the data silos, but it suffers from several challenges,
including autonomy, privacy guarantees, and the necessity of collaborations.
This paper focuses on developing an adaptive distributed kernel ridge
regression (AdaDKRR) by taking autonomy in parameter selection, privacy in
communicating non-sensitive information, and the necessity of collaborations in
performance improvement into account. We provide both solid theoretical
verification and comprehensive experiments for AdaDKRR to demonstrate its
feasibility and effectiveness. Theoretically, we prove that under some mild
conditions, AdaDKRR performs similarly to running the optimal learning
algorithms on the whole data, verifying the necessity of collaborations and
showing that no other distributed learning scheme can essentially beat AdaDKRR
under the same conditions. Numerically, we test AdaDKRR on both toy simulations
and two real-world applications to show that AdaDKRR is superior to other
existing distributed learning schemes. All these results show that AdaDKRR is a
feasible scheme to defend against data silos, which are highly desired in
numerous application regions such as intelligent decision-making, pricing
forecasting, and performance prediction for products.Comment: 46pages, 13figure
Kernel Instrumental Variable Regression
Instrumental variable (IV) regression is a strategy for learning causal
relationships in observational data. If measurements of input X and output Y
are confounded, the causal relationship can nonetheless be identified if an
instrumental variable Z is available that influences X directly, but is
conditionally independent of Y given X and the unmeasured confounder. The
classic two-stage least squares algorithm (2SLS) simplifies the estimation
problem by modeling all relationships as linear functions. We propose kernel
instrumental variable regression (KIV), a nonparametric generalization of 2SLS,
modeling relations among X, Y, and Z as nonlinear functions in reproducing
kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild
assumptions, and derive conditions under which convergence occurs at the
minimax optimal rate for unconfounded, single-stage RKHS regression. In doing
so, we obtain an efficient ratio between training sample sizes used in the
algorithm's first and second stages. In experiments, KIV outperforms state of
the art alternatives for nonparametric IV regression.Comment: 41 pages, 11 figures. Advances in Neural Information Processing
Systems. 201
Learning Sets with Separating Kernels
We consider the problem of learning a set from random samples. We show how
relevant geometric and topological properties of a set can be studied
analytically using concepts from the theory of reproducing kernel Hilbert
spaces. A new kind of reproducing kernel, that we call separating kernel, plays
a crucial role in our study and is analyzed in detail. We prove a new analytic
characterization of the support of a distribution, that naturally leads to a
family of provably consistent regularized learning algorithms and we discuss
the stability of these methods with respect to random sampling. Numerical
experiments show that the approach is competitive, and often better, than other
state of the art techniques.Comment: final versio
Amortised learning by wake-sleep
Models that employ latent variables to capture structure in observed data lie at the heart of many current unsupervised learning algorithms, but exact maximum-likelihood learning for powerful and flexible latent-variable models is almost always intractable. Thus, state-of-the-art approaches either abandon the maximum-likelihood framework entirely, or else rely on a variety of variational approximations to the posterior distribution over the latents. Here, we propose an alternative approach that we call amortised learning. Rather than computing an approximation to the posterior over latents, we use a wake-sleep Monte-Carlo strategy to learn a function that directly estimates the maximum-likelihood parameter updates. Amortised learning is possible whenever samples of latents and observations can be simulated from the generative model, treating the model as a “black box”. We demonstrate its effectiveness on a wide range of complex models, including those with latents that are discrete or supported on non-Euclidean spaces
Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction
We address the problem of causal effect estimation in the presence of
unobserved confounding, but where proxies for the latent confounder(s) are
observed. We propose two kernel-based methods for nonlinear causal effect
estimation in this setting: (a) a two-stage regression approach, and (b) a
maximum moment restriction approach. We focus on the proximal causal learning
setting, but our methods can be used to solve a wider class of inverse problems
characterised by a Fredholm integral equation. In particular, we provide a
unifying view of two-stage and moment restriction approaches for solving this
problem in a nonlinear setting. We provide consistency guarantees for each
algorithm, and we demonstrate these approaches achieve competitive results on
synthetic data and data simulating a real-world task. In particular, our
approach outperforms earlier methods that are not suited to leveraging proxy
variables
Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction
We address the problem of causal effect estima-tion in the presence of unobserved confounding,but where proxies for the latent confounder(s) areobserved. We propose two kernel-based meth-ods for nonlinear causal effect estimation in thissetting: (a) a two-stage regression approach, and(b) a maximum moment restriction approach. Wefocus on the proximal causal learning setting, butour methods can be used to solve a wider classof inverse problems characterised by a Fredholmintegral equation. In particular, we provide a uni-fying view of two-stage and moment restrictionapproaches for solving this problem in a nonlin-ear setting. We provide consistency guaranteesfor each algorithm, and demonstrate that these ap-proaches achieve competitive results on syntheticdata and data simulating a real-world task. In par-ticular, our approach outperforms earlier methodsthat are not suited to leveraging proxy variables
- …