6 research outputs found

    Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

    Full text link
    Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss F(θ)F(\theta) of nn components, we prove that HSDMPG can attain an ϵ\epsilon-optimization-error E[F(θ)F(θ)]ϵ\mathbb{E}[F(\theta)-F(\theta^*)]\leq\epsilon within O(κ1.5ϵ0.75log1.5(1ϵ)+1ϵ(κnlog1.5(1ϵ)+nlog(1ϵ)))\mathcal{O}\Big(\frac{\kappa^{1.5}\epsilon^{0.75}\log^{1.5}(\frac{1}{\epsilon})+1}{\epsilon}\wedge\Big(\kappa \sqrt{n}\log^{1.5}\big(\frac{1}{\epsilon}\big)+n\log\big(\frac{1}{\epsilon}\big)\Big)\Big) stochastic gradient evaluations, where κ\kappa is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of ϵ=O(1/n)\epsilon=\mathcal{O}\big(1/\sqrt{n}\big) which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG for quadratic and generic loss functions are respectively O(n0.875log1.5(n))\mathcal{O} (n^{0.875}\log^{1.5}(n)) and O(n0.875log2.25(n))\mathcal{O} (n^{0.875}\log^{2.25}(n)), which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones

    Understanding Generalization and Optimization Performance of Deep CNNs

    Full text link
    This work aims to provide understandings on the remarkable success of deep convolutional neural networks (CNNs) by theoretically analyzing their generalization performance and establishing optimization guarantees for gradient descent based training algorithms. Specifically, for a CNN model consisting of ll convolutional layers and one fully connected layer, we prove that its generalization error is bounded by \mathcal{O}(\sqrt{\dt\widetilde{\varrho}/n}) where θ\theta denotes freedom degree of the network parameters and \widetilde{\varrho}=\mathcal{O}(\log(\prod_{i=1}^{l}\rwi{i} (\ki{i}-\si{i}+1)/p)+\log(\rf)) encapsulates architecture parameters including the kernel size \ki{i}, stride \si{i}, pooling size pp and parameter magnitude \rwi{i}. To our best knowledge, this is the first generalization bound that only depends on \mathcal{O}(\log(\prod_{i=1}^{l+1}\rwi{i})), tighter than existing ones that all involve an exponential term like \mathcal{O}(\prod_{i=1}^{l+1}\rwi{i}). Besides, we prove that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary point to the population risk. This well explains why gradient descent training algorithms usually perform sufficiently well in practice. Furthermore, we prove the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks. It implies that the computed local minimum for the empirical risk is also close to a local minimum for the population risk, thus ensuring the good generalization performance of CNNs.Comment: This paper was accepted by ICML. It has 38 page

    Theoretical Deep Learning

    Get PDF
    Deep learning has long been criticised as a black-box model for lacking sound theoretical explanation. During the PhD course, I explore and establish theoretical foundations for deep learning. In this thesis, I present my contributions positioned upon existing literature: (1) analysing the generalizability of the neural networks with residual connections via complexity and capacity-based hypothesis complexity measures; (2) modeling stochastic gradient descent (SGD) by stochastic differential equations (SDEs) and their dynamics, and further characterizing the generalizability of deep learning; (3) understanding the geometrical structures of the loss landscape that drives the trajectories of the dynamic systems, which sheds light in reconciling the over-representation and excellent generalizability of deep learning; and (4) discovering the interplay between generalization, privacy preservation, and adversarial robustness, which have seen rising concerns in deep learning deployment
    corecore