92 research outputs found
Convergence of Adam under Relaxed Assumptions
In this paper, we provide a rigorous proof of convergence of the Adaptive
Moment Estimate (Adam) algorithm for a wide class of optimization objectives.
Despite the popularity and efficiency of the Adam algorithm in training deep
neural networks, its theoretical properties are not yet fully understood, and
existing convergence proofs require unrealistically strong assumptions, such as
globally bounded gradients, to show the convergence to stationary points. In
this paper, we show that Adam provably converges to -stationary
points with gradient complexity under far more
realistic conditions. The key to our analysis is a new proof of boundedness of
gradients along the optimization trajectory of Adam, under a generalized
smoothness assumption according to which the local smoothness (i.e., Hessian
norm when it exists) is bounded by a sub-quadratic function of the gradient
norm. Moreover, we propose a variance-reduced version of Adam with an
accelerated gradient complexity of .Comment: 33 page
Variance-reduced Clipping for Non-convex Optimization
Gradient clipping is a standard training technique used in deep learning
applications such as large-scale language modeling to mitigate exploding
gradients. Recent experimental studies have demonstrated a fairly special
behavior in the smoothness of the training objective along its trajectory when
trained with gradient clipping. That is, the smoothness grows with the gradient
norm. This is in clear contrast to the well-established assumption in folklore
non-convex optimization, a.k.a. --smoothness, where the smoothness is
assumed to be bounded by a constant globally. The recently introduced
--smoothness is a more relaxed notion that captures such behavior in
non-convex optimization. In particular, it has been shown that under this
relaxed smoothness assumption, SGD with clipping requires
stochastic gradient computations to find an --stationary solution. In
this paper, we employ a variance reduction technique, namely SPIDER, and
demonstrate that for a carefully designed learning rate, this complexity is
improved to which is order-optimal. Our designed learning
rate comprises the clipping technique to mitigate the growing smoothness.
Moreover, when the objective function is the average of components, we
improve the existing bound on the stochastic gradient
complexity to , which is order-optimal as well.
In addition to being theoretically optimal, SPIDER with our designed parameters
demonstrates comparable empirical performance against variance-reduced methods
such as SVRG and SARAH in several vision tasks
Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective
This work examines the deep disconnect between existing theoretical analyses
of gradient-based algorithms and the practice of training deep neural networks.
Specifically, we provide numerical evidence that in large-scale neural network
training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the
neural network's weights do not converge to stationary points where the
gradient of the loss is zero. Remarkably, however, we observe that even though
the weights do not converge to stationary points, the progress in minimizing
the loss function halts and training loss stabilizes. Inspired by this
observation, we propose a new perspective based on ergodic theory of dynamical
systems to explain it. Rather than studying the evolution of weights, we study
the evolution of the distribution of weights. We prove convergence of the
distribution of weights to an approximate invariant measure, thereby explaining
how the training loss can stabilize without weights necessarily converging to
stationary points. We further discuss how this perspective can better align
optimization theory with empirical observations in machine learning practice
Convex and Non-Convex Optimization under Generalized Smoothness
Classical analysis of convex and non-convex optimization methods often
requires the Lipshitzness of the gradient, which limits the analysis to
functions bounded by quadratics. Recent work relaxed this requirement to a
non-uniform smoothness condition with the Hessian norm bounded by an affine
function of the gradient norm, and proved convergence in the non-convex setting
via gradient clipping, assuming bounded noise. In this paper, we further
generalize this non-uniform smoothness condition and develop a simple, yet
powerful analysis technique that bounds the gradients along the trajectory,
thereby leading to stronger results for both convex and non-convex optimization
problems. In particular, we obtain the classical convergence rates for
(stochastic) gradient descent and Nesterov's accelerated gradient method in the
convex and/or non-convex setting under this general smoothness condition. The
new analysis approach does not require gradient clipping and allows
heavy-tailed noise with bounded variance in the stochastic setting.Comment: 39 page
ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance
Most of ranking models are trained only with displayed items (most are hot
items), but they are utilized to retrieve items in the entire space which
consists of both displayed and non-displayed items (most are long-tail items).
Due to the sample selection bias, the long-tail items lack sufficient records
to learn good feature representations, i.e. data sparsity and cold start
problems. The resultant distribution discrepancy between displayed and
non-displayed items would cause poor long-tail performance. To this end, we
propose an entire space adaptation model (ESAM) to address this problem from
the perspective of domain adaptation (DA). ESAM regards displayed and
non-displayed items as source and target domains respectively. Specifically, we
design the attribute correlation alignment that considers the correlation
between high-level attributes of the item to achieve distribution alignment.
Furthermore, we introduce two effective regularization strategies, i.e.
\textit{center-wise clustering} and \textit{self-training} to improve DA
process. Without requiring any auxiliary information and auxiliary domains,
ESAM transfers the knowledge from displayed items to non-displayed items for
alleviating the distribution inconsistency. Experiments on two public datasets
and a large-scale industrial dataset collected from Taobao demonstrate that
ESAM achieves state-of-the-art performance, especially in the long-tail space.
Besides, we deploy ESAM to the Taobao search engine, leading to significant
improvement on online performance. The code is available at
\url{https://github.com/A-bone1/ESAM.git}Comment: Accept by SIGIR-202
Live poultry trading drives China's H7N9 viral evolution and geographical network propagation
The on-going reassortment, human-adapted mutations, and spillover events of novel A(H7N9) avian influenza viruses pose a significant challenge to public health in China and globally. However, our understanding of the factors that disseminate the viruses and drive their geographic distributions is limited. We applied phylogenic analysis to examine the inter-subtype interactions between H7N9 viruses and the closest H9N2 lineages in China during 2010–2014. We reconstructed and compared the inter-provincial live poultry trading and viral propagation network via phylogeographic approach and network similarity technique. The substitution rates of the isolated viruses in live poultry markets and the characteristics of localized viral evolution were also evaluated. We discovered that viral propagation was geographically-structured and followed the live poultry trading network in China, with distinct north-to-east paths of spread and circular transmission between eastern and southern regions. The epicenter of H7N9 has moved from the Shanghai–Zhejiang region to Guangdong Province was also identified. Besides, higher substitution rate was observed among isolates sampled from live poultry markets, especially for those H7N9 viruses. Live poultry trading in China may have driven the network-structured expansion of the novel H7N9 viruses. From this perspective, long-distance geographic expansion of H7N9 were dominated by live poultry movements, while at local scales, diffusion was facilitated by live poultry markets with highly-evolved viruses
- …