7,663 research outputs found
Fast learning rate of deep learning via a kernel perspective
We develop a new theoretical framework to analyze the generalization error of
deep learning, and derive a new fast learning rate for two representative
algorithms: empirical risk minimization and Bayesian deep learning. The series
of theoretical analyses of deep learning has revealed its high expressive power
and universal approximation capability. Although these analyses are highly
nonparametric, existing generalization error analyses have been developed
mainly in a fixed dimensional parametric model. To compensate this gap, we
develop an infinite dimensional model that is based on an integral form as
performed in the analysis of the universal approximation capability. This
allows us to define a reproducing kernel Hilbert space corresponding to each
layer. Our point of view is to deal with the ordinary finite dimensional deep
neural network as a finite approximation of the infinite dimensional one. The
approximation error is evaluated by the degree of freedom of the reproducing
kernel Hilbert space in each layer. To estimate a good finite dimensional
model, we consider both of empirical risk minimization and Bayesian deep
learning. We derive its generalization error bound and it is shown that there
appears bias-variance trade-off in terms of the number of parameters of the
finite dimensional approximation. We show that the optimal width of the
internal layers can be determined through the degree of freedom and the
convergence rate can be faster than rate which has been shown
in the existing studies.Comment: 36 page
Nonparametric regression using needlet kernels for spherical data
Needlets have been recognized as state-of-the-art tools to tackle spherical
data, due to their excellent localization properties in both spacial and
frequency domains.
This paper considers developing kernel methods associated with the needlet
kernel for nonparametric regression problems whose predictor variables are
defined on a sphere. Due to the localization property in the frequency domain,
we prove that the regularization parameter of the kernel ridge regression
associated with the needlet kernel can decrease arbitrarily fast. A natural
consequence is that the regularization term for the kernel ridge regression is
not necessary in the sense of rate optimality. Based on the excellent
localization property in the spacial domain further, we also prove that all the
kernel regularization estimates associated with
the needlet kernel, including the kernel lasso estimate and the kernel bridge
estimate, possess almost the same generalization capability for a large range
of regularization parameters in the sense of rate optimality.
This finding tentatively reveals that, if the needlet kernel is utilized,
then the choice of might not have a strong impact in terms of the
generalization capability in some modeling contexts. From this perspective,
can be arbitrarily specified, or specified merely by other no generalization
criteria like smoothness, computational complexity, sparsity, etc..Comment: 21 page
Learning rates of coefficient regularization learning with Gaussian kernel
Regularization is a well recognized powerful strategy to improve the
performance of a learning machine and regularization schemes with
are central in use. It is known that different leads to
different properties of the deduced estimators, say, regularization leads
to smooth estimators while regularization leads to sparse estimators.
Then, how does the generalization capabilities of regularization learning
vary with ? In this paper, we study this problem in the framework of
statistical learning theory and show that implementing coefficient
regularization schemes in the sample dependent hypothesis space associated with
Gaussian kernel can attain the same almost optimal learning rates for all
. That is, the upper and lower bounds of learning rates for
regularization learning are asymptotically identical for all . Our
finding tentatively reveals that, in some modeling contexts, the choice of
might not have a strong impact with respect to the generalization capability.
From this perspective, can be arbitrarily specified, or specified merely by
other no generalization criteria like smoothness, computational complexity,
sparsity, etc..Comment: 26 pages, 3 figure
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
It is widely observed that deep learning models with learned parameters
generalize well, even with much more model parameters than the number of
training samples. We systematically investigate the underlying reasons why deep
neural networks often generalize well, and reveal the difference between the
minima (with the same training error) that generalize well and those they
don't. We show that it is the characteristics the landscape of the loss
function that explains the good generalization capability. For the landscape of
loss function for deep networks, the volume of basin of attraction of good
minima dominates over that of poor minima, which guarantees optimization
methods with random initialization to converge to good minima. We theoretically
justify our findings through analyzing 2-layer neural networks; and show that
the low-complexity solutions have a small norm of Hessian matrix with respect
to model parameters. For deeper networks, extensive numerical evidence helps to
support our arguments
On the Feasibility of Distributed Kernel Regression for Big Data
In modern scientific research, massive datasets with huge numbers of
observations are frequently encountered. To facilitate the computational
process, a divide-and-conquer scheme is often used for the analysis of big
data. In such a strategy, a full dataset is first split into several manageable
segments; the final output is then averaged from the individual outputs of the
segments. Despite its popularity in practice, it remains largely unknown that
whether such a distributive strategy provides valid theoretical inferences to
the original data. In this paper, we address this fundamental issue for the
distributed kernel regression (DKR), where the algorithmic feasibility is
measured by the generalization performance of the resulting estimator. To
justify DKR, a uniform convergence rate is needed for bounding the
generalization error over the individual outputs, which brings new and
challenging issues in the big data setup. Under mild conditions, we show that,
with a proper number of segments, DKR leads to an estimator that is
generalization consistent to the unknown regression function. The obtained
results justify the method of DKR and shed light on the feasibility of using
other distributed algorithms for processing big data. The promising preference
of the method is supported by both simulation and real data examples.Comment: 22 pages, 4 figure
Discriminative Similarity for Clustering and Semi-Supervised Learning
Similarity-based clustering and semi-supervised learning methods separate the
data into clusters or classes according to the pairwise similarity between the
data, and the pairwise similarity is crucial for their performance. In this
paper, we propose a novel discriminative similarity learning framework which
learns discriminative similarity for either data clustering or semi-supervised
learning. The proposed framework learns classifier from each hypothetical
labeling, and searches for the optimal labeling by minimizing the
generalization error of the learned classifiers associated with the
hypothetical labeling. Kernel classifier is employed in our framework. By
generalization analysis via Rademacher complexity, the generalization error
bound for the kernel classifier learned from hypothetical labeling is expressed
as the sum of pairwise similarity between the data from different classes,
parameterized by the weights of the kernel classifier. Such pairwise similarity
serves as the discriminative similarity for the purpose of clustering and
semi-supervised learning, and discriminative similarity with similar form can
also be induced by the integrated squared error bound for kernel density
classification. Based on the discriminative similarity induced by the kernel
classifier, we propose new clustering and semi-supervised learning methods
An Empirical Study on Regularization of Deep Neural Networks by Local Rademacher Complexity
Regularization of Deep Neural Networks (DNNs) for the sake of improving their
generalization capability is important and challenging. The development in this
line benefits theoretical foundation of DNNs and promotes their usability in
different areas of artificial intelligence. In this paper, we investigate the
role of Rademacher complexity in improving generalization of DNNs and propose a
novel regularizer rooted in Local Rademacher Complexity (LRC). While Rademacher
complexity is well known as a distribution-free complexity measure of function
class that help boost generalization of statistical learning methods, extensive
study shows that LRC, its counterpart focusing on a restricted function class,
leads to sharper convergence rates and potential better generalization given
finite training sample. Our LRC based regularizer is developed by estimating
the complexity of the function class centered at the minimizer of the empirical
loss of DNNs. Experiments on various types of network architecture demonstrate
the effectiveness of LRC regularization in improving generalization. Moreover,
our method features the state-of-the-art result on the CIFAR- dataset with
network architecture found by neural architecture search.Comment: Updated the link to the open source PaddlePaddle code of LRC
Regularization as well as the author lis
Learning through deterministic assignment of hidden parameters
Supervised learning frequently boils down to determining hidden and bright
parameters in a parameterized hypothesis space based on finite input-output
samples. The hidden parameters determine the attributions of hidden predictors
or the nonlinear mechanism of an estimator, while the bright parameters
characterize how hidden predictors are linearly combined or the linear
mechanism. In traditional learning paradigm, hidden and bright parameters are
not distinguished and trained simultaneously in one learning process. Such an
one-stage learning (OSL) brings a benefit of theoretical analysis but suffers
from the high computational burden. To overcome this difficulty, a two-stage
learning (TSL) scheme, featured by learning through deterministic assignment of
hidden parameters (LtDaHP) was proposed, which suggests to deterministically
generate the hidden parameters by using minimal Riesz energy points on a sphere
and equally spaced points in an interval. We theoretically show that with such
deterministic assignment of hidden parameters, LtDaHP with a neural network
realization almost shares the same generalization performance with that of OSL.
We also present a series of simulations and application examples to support the
outperformance of LtDaH
Theoretical Analysis of Adversarial Learning: A Minimax Approach
Here we propose a general theoretical method for analyzing the risk bound in
the presence of adversaries. Specifically, we try to fit the adversarial
learning problem into the minimax framework. We first show that the original
adversarial learning problem can be reduced to a minimax statistical learning
problem by introducing a transport map between distributions. Then, we prove a
new risk bound for this minimax problem in terms of covering numbers under a
weak version of Lipschitz condition. Our method can be applied to multi-class
classification problems and commonly used loss functions such as the hinge and
ramp losses. As some illustrative examples, we derive the adversarial risk
bounds for SVMs, deep neural networks, and PCA, and our bounds have two
data-dependent terms, which can be optimized for achieving adversarial
robustness.Comment: 27 pages, add some reference
Large-scale Kernel-based Feature Extraction via Budgeted Nonlinear Subspace Tracking
Kernel-based methods enjoy powerful generalization capabilities in handling a
variety of learning tasks. When such methods are provided with sufficient
training data, broadly-applicable classes of nonlinear functions can be
approximated with desired accuracy. Nevertheless, inherent to the nonparametric
nature of kernel-based estimators are computational and memory requirements
that become prohibitive with large-scale datasets. In response to this
formidable challenge, the present work puts forward a low-rank, kernel-based,
feature extraction approach that is particularly tailored for online operation,
where data streams need not be stored in memory. A novel generative model is
introduced to approximate high-dimensional (possibly infinite) features via a
low-rank nonlinear subspace, the learning of which leads to a direct kernel
function approximation. Offline and online solvers are developed for the
subspace learning task, along with affordable versions, in which the number of
stored data vectors is confined to a predefined budget. Analytical results
provide performance bounds on how well the kernel matrix as well as
kernel-based classification and regression tasks can be approximated by
leveraging budgeted online subspace learning and feature extraction schemes.
Tests on synthetic and real datasets demonstrate and benchmark the efficiency
of the proposed method when linear classification and regression is applied to
the extracted features
- …