7 research outputs found

    Randomized Stochastic Variance-Reduced Methods for Multi-Task Stochastic Bilevel Optimization

    Full text link
    In this paper, we consider non-convex stochastic bilevel optimization (SBO) problems that have many applications in machine learning. Although numerous studies have proposed stochastic algorithms for solving these problems, they are limited in two perspectives: (i) their sample complexities are high, which do not match the state-of-the-art result for non-convex stochastic optimization; (ii) their algorithms are tailored to problems with only one lower-level problem. When there are many lower-level problems, it could be prohibitive to process all these lower-level problems at each iteration. To address these limitations, this paper proposes fast randomized stochastic algorithms for non-convex SBO problems. First, we present a stochastic method for non-convex SBO with only one lower problem and establish its sample complexity of O(1/ϵ3)O(1/\epsilon^3) for finding an ϵ\epsilon-stationary point under Lipschitz continuous conditions of stochastic oracles, matching the lower bound for stochastic smooth non-convex optimization. Second, we present a randomized stochastic method for non-convex SBO with m>1m>1 lower level problems (multi-task SBO) by processing a constant number of lower problems at each iteration, and establish its sample complexity no worse than O(m/ϵ3)O(m/\epsilon^3), which could be a better complexity than that of simply processing all mm lower problems at each iteration. Lastly, we establish even faster convergence results for gradient-dominant functions. To the best of our knowledge, this is the first work considering multi-task SBO and developing state-of-the-art sample complexity results

    On Biased Stochastic Gradient Estimation

    Full text link
    We present a uniform analysis of biased stochastic gradient methods for minimizing convex, strongly convex, and non-convex composite objectives, and identify settings where bias is useful in stochastic gradient estimation. The framework we present allows us to extend proximal support to biased algorithms, including SAG and SARAH, for the first time in the convex setting. We also use our framework to develop a new algorithm, Stochastic Average Recursive GradiEnt (SARGE), that achieves the oracle complexity lower-bound for non-convex, finite-sum objectives and requires strictly fewer calls to a stochastic gradient oracle per iteration than SVRG and SARAH. We support our theoretical results with numerical experiments that demonstrate the benefits of certain biased gradient estimators.Comment: journal version, 35 page

    Accelerating Variance-Reduced Stochastic Gradient Methods

    Full text link
    Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov's acceleration techniques to match the convergence rates of accelerated gradient methods. Such approaches rely on "negative momentum", a technique for further variance reduction that is generally specific to the SVRG gradient estimator. In this work, we show that negative momentum is unnecessary for acceleration and develop a universal acceleration framework that allows all popular variance-reduced methods to achieve accelerated convergence rates. The constants appearing in these rates, including their dependence on the number of functions nn, scale with the mean-squared-error and bias of the gradient estimator. In a series of numerical experiments, we demonstrate that versions of SAGA, SVRG, SARAH, and SARGE using our framework significantly outperform non-accelerated versions and compare favourably with algorithms using negative momentum.Comment: 33 page

    ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization

    Full text link
    We propose a new stochastic first-order algorithmic framework to solve stochastic composite nonconvex optimization problems that covers both finite-sum and expectation settings. Our algorithms rely on the SARAH estimator introduced in (Nguyen et al, 2017) and consist of two steps: a proximal gradient and an averaging step making them different from existing nonconvex proximal-type algorithms. The algorithms only require an average smoothness assumption of the nonconvex objective term and additional bounded variance assumption if applied to expectation problems. They work with both constant and adaptive step-sizes, while allowing single sample and mini-batches. In all these cases, we prove that our algorithms can achieve the best-known complexity bounds. One key step of our methods is new constant and adaptive step-sizes that help to achieve desired complexity bounds while improving practical performance. Our constant step-size is much larger than existing methods including proximal SVRG schemes in the single sample case. We also specify the algorithm to the non-composite case that covers existing state-of-the-arts in terms of complexity bounds. Our update also allows one to trade-off between step-sizes and mini-batch sizes to improve performance. We test the proposed algorithms on two composite nonconvex problems and neural networks using several well-known datasets.Comment: 45 pages, 8 figures, and 2 tabl

    Faster Stochastic Quasi-Newton Methods

    Full text link
    Stochastic optimization methods have become a class of popular optimization tools in machine learning. Especially, stochastic gradient descent (SGD) has been widely used for machine learning problems such as training neural networks due to low per-iteration computational complexity. In fact, the Newton or quasi-newton methods leveraging second-order information are able to achieve a better solution than the first-order methods. Thus, stochastic quasi-Newton (SQN) methods have been developed to achieve the better solution efficiently than the stochastic first-order methods by utilizing approximate second-order information. However, the existing SQN methods still do not reach the best known stochastic first-order oracle (SFO) complexity. To fill this gap, we propose a novel faster stochastic quasi-Newton method (SpiderSQN) based on the variance reduced technique of SIPDER. We prove that our SpiderSQN method reaches the best known SFO complexity of O(n+n1/2ϵ2)\mathcal{O}(n+n^{1/2}\epsilon^{-2}) in the finite-sum setting to obtain an ϵ\epsilon-first-order stationary point. To further improve its practical performance, we incorporate SpiderSQN with different momentum schemes. Moreover, the proposed algorithms are generalized to the online setting, and the corresponding SFO complexity of O(ϵ3)\mathcal{O}(\epsilon^{-3}) is developed, which also matches the existing best result. Extensive experiments on benchmark datasets demonstrate that our new algorithms outperform state-of-the-art approaches for nonconvex optimization.Comment: 11 pages, accepted for publication by TNNLS. arXiv admin note: text overlap with arXiv:1902.02715 by other author

    A Hybrid Stochastic Optimization Framework for Stochastic Composite Nonconvex Optimization

    Full text link
    We introduce a new approach to develop stochastic optimization algorithms for a class of stochastic composite and possibly nonconvex optimization problems. The main idea is to combine two stochastic estimators to create a new hybrid one. We first introduce our hybrid estimator and then investigate its fundamental properties to form a foundational theory for algorithmic development. Next, we apply our theory to develop several variants of stochastic gradient methods to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single-loop, but can achieve O(σ3ε1+σε3)\mathcal{O}(\sigma^3\varepsilon^{-1} + \sigma \varepsilon^{-3})-oracle complexity bound, matching the best-known ones from state-of-the-art double-loop algorithms in the literature, where σ>0\sigma > 0 is the variance and ε\varepsilon is a desired accuracy. Then, we consider two different variants of our method: adaptive step-size and restarting schemes that have similar theoretical guarantees as in our first algorithm. We also study two mini-batch variants of the proposed methods. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our methods on several numerical examples with real datasets and compare them with state-of-the-arts. Our numerical experiments show that the new methods are comparable and, in many cases, outperform their competitors.Comment: 49 pages, 2 tables, 9 figure

    Momentum Schemes with Stochastic Variance Reduction for Nonconvex Composite Optimization

    Full text link
    Two new stochastic variance-reduced algorithms named SARAH and SPIDER have been recently proposed, and SPIDER has been shown to achieve a near-optimal gradient oracle complexity for nonconvex optimization. However, the theoretical advantage of SPIDER does not lead to substantial improvement of practical performance over SVRG. To address this issue, momentum technique can be a good candidate to improve the performance of SPIDER. However, existing momentum schemes used in variance-reduced algorithms are designed specifically for convex optimization, and are not applicable to nonconvex scenarios. In this paper, we develop novel momentum schemes with flexible coefficient settings to accelerate SPIDER for nonconvex and nonsmooth composite optimization, and show that the resulting algorithms achieve the near-optimal gradient oracle complexity for achieving a generalized first-order stationary condition. Furthermore, we generalize our algorithm to online nonconvex and nonsmooth optimization, and establish an oracle complexity result that matches the state-of-the-art. Our extensive experiments demonstrate the superior performance of our proposed algorithm over other stochastic variance-reduced algorithms.Comment: We are merging the results of this paper with another paper at arXiv:1810.10690. Therefore, we want to withdraw this pape