1,410 research outputs found
Optimal learning rates for least squares regularized regression with unbounded sampling
AbstractA standard assumption in theoretical study of learning algorithms for regression is uniform boundedness of output sample values. This excludes the common case with Gaussian noise. In this paper we investigate the learning algorithm for regression generated by the least squares regularization scheme in reproducing kernel Hilbert spaces without the assumption of uniform boundedness for sampling. By imposing some incremental conditions on moments of the output variable, we derive learning rates in terms of regularity of the regression function and capacity of the hypothesis space. The novelty of our analysis is a new covering number argument for bounding the sample error
Differentially Private Stochastic Gradient Descent with Low-Noise
Modern machine learning algorithms aim to extract fine-grained information
from data to provide accurate predictions, which often conflicts with the goal
of privacy protection. This paper addresses the practical and theoretical
importance of developing privacy-preserving machine learning algorithms that
ensure good performance while preserving privacy. In this paper, we focus on
the privacy and utility (measured by excess risk bounds) performances of
differentially private stochastic gradient descent (SGD) algorithms in the
setting of stochastic convex optimization. Specifically, we examine the
pointwise problem in the low-noise setting for which we derive sharper excess
risk bounds for the differentially private SGD algorithm. In the pairwise
learning setting, we propose a simple differentially private SGD algorithm
based on gradient perturbation. Furthermore, we develop novel utility bounds
for the proposed algorithm, proving that it achieves optimal excess risk rates
even for non-smooth losses. Notably, we establish fast learning rates for
privacy-preserving pairwise learning under the low-noise condition, which is
the first of its kind
Generalization Guarantees of Gradient Descent for Multi-Layer Neural Networks
Recently, significant progress has been made in understanding the
generalization of neural networks (NNs) trained by gradient descent (GD) using
the algorithmic stability approach. However, most of the existing research has
focused on one-hidden-layer NNs and has not addressed the impact of different
network scaling parameters. In this paper, we greatly extend the previous work
\cite{lei2022stability,richards2021stability} by conducting a comprehensive
stability and generalization analysis of GD for multi-layer NNs. For two-layer
NNs, our results are established under general network scaling parameters,
relaxing previous conditions. In the case of three-layer NNs, our technical
contribution lies in demonstrating its nearly co-coercive property by utilizing
a novel induction strategy that thoroughly explores the effects of
over-parameterization. As a direct application of our general findings, we
derive the excess risk rate of for GD algorithms in both
two-layer and three-layer NNs. This sheds light on sufficient or necessary
conditions for under-parameterized and over-parameterized NNs trained by GD to
attain the desired risk rate of . Moreover, we demonstrate that
as the scaling parameter increases or the network complexity decreases, less
over-parameterization is required for GD to achieve the desired error rates.
Additionally, under a low-noise condition, we obtain a fast risk rate of
for GD in both two-layer and three-layer NNs.Comment: 38 pages, 2 figure
Stability and Generalization for Markov Chain Stochastic Gradient Methods
Recently there is a large amount of work devoted to the study of Markov chain
stochastic gradient methods (MC-SGMs) which mainly focus on their convergence
analysis for solving minimization problems. In this paper, we provide a
comprehensive generalization analysis of MC-SGMs for both minimization and
minimax problems through the lens of algorithmic stability in the framework of
statistical learning theory. For empirical risk minimization (ERM) problems, we
establish the optimal excess population risk bounds for both smooth and
non-smooth cases by introducing on-average argument stability. For minimax
problems, we develop a quantitative connection between on-average argument
stability and generalization error which extends the existing results for
uniform stability \cite{lei2021stability}. We further develop the first nearly
optimal convergence rates for convex-concave problems both in expectation and
with high probability, which, combined with our stability results, show that
the optimal generalization bounds can be attained for both smooth and
non-smooth cases. To the best of our knowledge, this is the first
generalization analysis of SGMs when the gradients are sampled from a Markov
process
Adaptive Distributed Kernel Ridge Regression: A Feasible Distributed Learning Scheme for Data Silos
Data silos, mainly caused by privacy and interoperability, significantly
constrain collaborations among different organizations with similar data for
the same purpose. Distributed learning based on divide-and-conquer provides a
promising way to settle the data silos, but it suffers from several challenges,
including autonomy, privacy guarantees, and the necessity of collaborations.
This paper focuses on developing an adaptive distributed kernel ridge
regression (AdaDKRR) by taking autonomy in parameter selection, privacy in
communicating non-sensitive information, and the necessity of collaborations in
performance improvement into account. We provide both solid theoretical
verification and comprehensive experiments for AdaDKRR to demonstrate its
feasibility and effectiveness. Theoretically, we prove that under some mild
conditions, AdaDKRR performs similarly to running the optimal learning
algorithms on the whole data, verifying the necessity of collaborations and
showing that no other distributed learning scheme can essentially beat AdaDKRR
under the same conditions. Numerically, we test AdaDKRR on both toy simulations
and two real-world applications to show that AdaDKRR is superior to other
existing distributed learning schemes. All these results show that AdaDKRR is a
feasible scheme to defend against data silos, which are highly desired in
numerous application regions such as intelligent decision-making, pricing
forecasting, and performance prediction for products.Comment: 46pages, 13figure
- …