92 research outputs found

    Convergence of Adam under Relaxed Assumptions

    Full text link
    In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to ϵ\epsilon-stationary points with O(ϵ−4)\mathcal{O}(\epsilon^{-4}) gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of O(ϵ−3)\mathcal{O}(\epsilon^{-3}).Comment: 33 page

    Variance-reduced Clipping for Non-convex Optimization

    Full text link
    Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. LL--smoothness, where the smoothness is assumed to be bounded by a constant LL globally. The recently introduced (L0,L1)(L_0,L_1)--smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires O(ϵ−4)O(\epsilon^{-4}) stochastic gradient computations to find an ϵ\epsilon--stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to O(ϵ−3)O(\epsilon^{-3}) which is order-optimal. Our designed learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of nn components, we improve the existing O(nϵ−2)O(n\epsilon^{-2}) bound on the stochastic gradient complexity to O(nϵ−2+n)O(\sqrt{n} \epsilon^{-2} + n), which is order-optimal as well. In addition to being theoretically optimal, SPIDER with our designed parameters demonstrates comparable empirical performance against variance-reduced methods such as SVRG and SARAH in several vision tasks

    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

    Full text link
    This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice

    Convex and Non-Convex Optimization under Generalized Smoothness

    Full text link
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.Comment: 39 page

    ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance

    Full text link
    Most of ranking models are trained only with displayed items (most are hot items), but they are utilized to retrieve items in the entire space which consists of both displayed and non-displayed items (most are long-tail items). Due to the sample selection bias, the long-tail items lack sufficient records to learn good feature representations, i.e. data sparsity and cold start problems. The resultant distribution discrepancy between displayed and non-displayed items would cause poor long-tail performance. To this end, we propose an entire space adaptation model (ESAM) to address this problem from the perspective of domain adaptation (DA). ESAM regards displayed and non-displayed items as source and target domains respectively. Specifically, we design the attribute correlation alignment that considers the correlation between high-level attributes of the item to achieve distribution alignment. Furthermore, we introduce two effective regularization strategies, i.e. \textit{center-wise clustering} and \textit{self-training} to improve DA process. Without requiring any auxiliary information and auxiliary domains, ESAM transfers the knowledge from displayed items to non-displayed items for alleviating the distribution inconsistency. Experiments on two public datasets and a large-scale industrial dataset collected from Taobao demonstrate that ESAM achieves state-of-the-art performance, especially in the long-tail space. Besides, we deploy ESAM to the Taobao search engine, leading to significant improvement on online performance. The code is available at \url{https://github.com/A-bone1/ESAM.git}Comment: Accept by SIGIR-202

    Live poultry trading drives China's H7N9 viral evolution and geographical network propagation

    Get PDF
    The on-going reassortment, human-adapted mutations, and spillover events of novel A(H7N9) avian influenza viruses pose a significant challenge to public health in China and globally. However, our understanding of the factors that disseminate the viruses and drive their geographic distributions is limited. We applied phylogenic analysis to examine the inter-subtype interactions between H7N9 viruses and the closest H9N2 lineages in China during 2010–2014. We reconstructed and compared the inter-provincial live poultry trading and viral propagation network via phylogeographic approach and network similarity technique. The substitution rates of the isolated viruses in live poultry markets and the characteristics of localized viral evolution were also evaluated. We discovered that viral propagation was geographically-structured and followed the live poultry trading network in China, with distinct north-to-east paths of spread and circular transmission between eastern and southern regions. The epicenter of H7N9 has moved from the Shanghai–Zhejiang region to Guangdong Province was also identified. Besides, higher substitution rate was observed among isolates sampled from live poultry markets, especially for those H7N9 viruses. Live poultry trading in China may have driven the network-structured expansion of the novel H7N9 viruses. From this perspective, long-distance geographic expansion of H7N9 were dominated by live poultry movements, while at local scales, diffusion was facilitated by live poultry markets with highly-evolved viruses
    • …
    corecore