6 research outputs found
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to
work favorably in solving large-scale learning problems. Despite the remarkable
success, the stochastic gradient complexity of SVRG-type algorithms usually
scales linearly with data size and thus could still be expensive for huge data.
To address this deficiency, we propose a hybrid stochastic-deterministic
minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems
that enjoys provably improved data-size-independent complexity guarantees. More
precisely, for quadratic loss of components, we prove that
HSDMPG can attain an -optimization-error
within
stochastic gradient evaluations, where is condition number. For
generic strongly convex loss functions, we prove a nearly identical complexity
bound though at the cost of slightly increased logarithmic factors. For
large-scale learning problems, our complexity bounds are superior to those of
the prior state-of-the-art SVRG algorithms with or without dependence on data
size. Particularly, in the case of
which is at the order of intrinsic excess error bound of a learning model and
thus sufficient for generalization, the stochastic gradient complexity bounds
of HSDMPG for quadratic and generic loss functions are respectively
and , which to our best knowledge, for the first time
achieve optimal generalization in less than a single pass over data. Extensive
numerical results demonstrate the computational advantages of our algorithm
over the prior ones
Understanding Generalization and Optimization Performance of Deep CNNs
This work aims to provide understandings on the remarkable success of deep
convolutional neural networks (CNNs) by theoretically analyzing their
generalization performance and establishing optimization guarantees for
gradient descent based training algorithms. Specifically, for a CNN model
consisting of convolutional layers and one fully connected layer, we prove
that its generalization error is bounded by
\mathcal{O}(\sqrt{\dt\widetilde{\varrho}/n}) where denotes freedom
degree of the network parameters and
\widetilde{\varrho}=\mathcal{O}(\log(\prod_{i=1}^{l}\rwi{i}
(\ki{i}-\si{i}+1)/p)+\log(\rf)) encapsulates architecture parameters including
the kernel size \ki{i}, stride \si{i}, pooling size and parameter
magnitude \rwi{i}. To our best knowledge, this is the first generalization
bound that only depends on \mathcal{O}(\log(\prod_{i=1}^{l+1}\rwi{i})),
tighter than existing ones that all involve an exponential term like
\mathcal{O}(\prod_{i=1}^{l+1}\rwi{i}). Besides, we prove that for an
arbitrary gradient descent algorithm, the computed approximate stationary point
by minimizing empirical risk is also an approximate stationary point to the
population risk. This well explains why gradient descent training algorithms
usually perform sufficiently well in practice. Furthermore, we prove the
one-to-one correspondence and convergence guarantees for the non-degenerate
stationary points between the empirical and population risks. It implies that
the computed local minimum for the empirical risk is also close to a local
minimum for the population risk, thus ensuring the good generalization
performance of CNNs.Comment: This paper was accepted by ICML. It has 38 page
Theoretical Deep Learning
Deep learning has long been criticised as a black-box model for lacking sound theoretical explanation. During the PhD course, I explore and establish theoretical foundations for deep learning. In this thesis, I present my contributions positioned upon existing literature: (1) analysing the generalizability of the neural networks with residual connections via complexity and capacity-based hypothesis complexity measures; (2) modeling stochastic gradient descent (SGD) by stochastic differential equations (SDEs) and their dynamics, and further characterizing the generalizability of deep learning; (3) understanding the geometrical structures of the loss landscape that drives the trajectories of the dynamic systems, which sheds light in reconciling the over-representation and excellent generalizability of deep learning; and (4) discovering the interplay between generalization, privacy preservation, and adversarial robustness, which have seen rising concerns in deep learning deployment