Search CORE

14 research outputs found

Single-Source Bottleneck Path Algorithm Faster than Sorting for Sparse Graphs

Author: Duan Ran
Lyu Kaifeng
Xie Yuanhang
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

In a directed graph G=(V,E) with a capacity on every edge, a bottleneck path (or widest path) between two vertices is a path maximizing the minimum capacity of edges in the path. For the single-source all-destination version of this problem in directed graphs, the previous best algorithm runs in O(m+n log n) (m=|E| and n=|V|) time, by Dijkstra search with Fibonacci heap [Fredman and Tarjan 1987]. We improve this time bound to O(m sqrt{log n}+sqrt{mn log n log log n}), which is O(n sqrt{log n log log n}) when m=O(n), thus it is the first algorithm which breaks the time bound of classic Fibonacci heap when m=o(n sqrt{log n}). It is a Las-Vegas randomized approach. By contrast, the s-t bottleneck path has algorithm with running time O(m beta(m,n)) [Chechik et al. 2016], where beta(m,n)=min{k >= 1: log^{(k)}n <= m/n}

Dagstuhl Research Online Publication Server

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Author: Arora Sanjeev
Li Zhiyuan
Lyu Kaifeng
Publication venue
Publication date: 01/01/2020
Field of study

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints, and then initiates a formal framework for studying their mathematics via suitable adaptation of the conventional framework namely, modeling SGD-induced training trajectory via a suitable stochastic differential equation (SDE) with a noise term that captures gradient noise. This yields: (a) A new ' intrinsic learning rate' parameter that is the product of the normal learning rate and weight decay factor. Analysis of the SDE shows how the effective speed of learning varies and equilibrates over time under the control of intrinsic LR. (b) A challenge -- via theory and experiments -- to popular belief that good generalization requires large learning rates at the start of training. (c) New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.Comment: 25 pages, 12 figures. Accepted By 34th Conference on Neural Information Processing Systems (NeurIPS 2020

arXiv.org e-Print Archive

Princeton University Open Access Repository

A Quadratic Synchronization Rule for Distributed Deep Learning

Author: Arora Sanjeev
Gu Xinran
Huang Longbo
Lyu Kaifeng
Zhang Jingzhao
Publication venue
Publication date: 22/10/2023
Field of study

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for

H

steps without synchronizing with others, hence reducing communication frequency. While

H

has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper

H

value can lead to generalization improvement. Yet, selecting a proper

H

is elusive. This work proposes a theory-grounded method for determining

H

, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting

H

in proportion to

\frac{1}{\eta^2}

as the learning rate

\eta

decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves

1.16\%

0.84\%

higher top-1 validation accuracy

arXiv.org e-Print Archive

The Marginal Value of Momentum for Small Learning Rate SGD

Author: Li Zhiyuan
Lyu Kaifeng
Malladi Sadhika
Wang Runzhe
Wang Tianhao
Publication venue
Publication date: 27/07/2023
Field of study

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks

arXiv.org e-Print Archive

New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound

Author: Arora Sanjeev
Gupta Arushi
Lyu Kaifeng
Saunshi Nikunj
Yu Dingli
Publication venue
Publication date: 05/11/2022
Field of study

Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the

k

highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.Comment: NeurIPS 2022 (Oral

arXiv.org e-Print Archive

Fine-grained Complexity Meets IP = PSPACE

Author: Chen Lijie
Goldwasser Shafi
Lyu Kaifeng
Rothblum Guy N.
Rubinstein Aviad
Publication venue
Publication date: 03/11/2018
Field of study

In this paper we study the fine-grained complexity of finding exact and approximate solutions to problems in P. Our main contribution is showing reductions from exact to approximate solution for a host of such problems. As one (notable) example, we show that the Closest-LCS-Pair problem (Given two sets of strings

A

and

B

, compute exactly the maximum

\textsf{LCS}(a, b)

with

(a, b) \in A \times B

) is equivalent to its approximation version (under near-linear time reductions, and with a constant approximation factor). More generally, we identify a class of problems, which we call BP-Pair-Class, comprising both exact and approximate solutions, and show that they are all equivalent under near-linear time reductions. Exploring this class and its properties, we also show:

\bullet

Under the NC-SETH assumption (a significantly more relaxed assumption than SETH), solving any of the problems in this class requires essentially quadratic time.

\bullet

Modest improvements on the running time of known algorithms (shaving log factors) would imply that NEXP is not in non-uniform

\textsf{NC}^1

\bullet

Finally, we leverage our techniques to show new barriers for deterministic approximation algorithms for LCS. At the heart of these new results is a deep connection between interactive proof systems for bounded-space computations and the fine-grained complexity of exact and approximate solutions to problems in P. In particular, our results build on the proof techniques from the classical IP = PSPACE result

arXiv.org e-Print Archive

A combination of phospholipids and long chain polyunsaturated fatty acids supports neurodevelopmental outcomes in infants: a randomized, double-blind, controlled clinical trial

Author: Jiancun Pan
Jiancun Pan
Kaifeng Li
Kaifeng Li
Qinggang Xie
Qinggang Xie
Qiqi Ren
Qiqi Ren
Xiaoyu Zhu
Xiaoyu Zhu
Xiaoyu Zhu
Yajun Xu
Yajun Xu
Yajun Xu
Yalin Zhou
Yalin Zhou
Yalin Zhou
Ying Lyu
Ying Lyu
Ying Lyu
Publication venue: Frontiers Media S.A.
Publication date: 01/06/2024
Field of study

Phospholipids (PLs) and long-chain polyunsaturated fatty acids (LCPUFAs) are naturally present in breast milk and play important roles in promoting the growth of the infant. Several studies have investigated the effects of the combination of PLs and LCPUFAs on neurodevelopment. However, data on the effectiveness of infant formula containing both PLs and LCPUFAs on the neurodevelopment of infants is still scarce. This randomized, double-blind, controlled clinical study was designed to evaluate the effect of an infant formula enriched with PLs and LCPUFAs on growth parameters and neurodevelopmental outcomes in term infants up to 365 days of age. Infants were enrolled within 30 days of birth who were then randomly assigned to either a control group (n = 150) or an investigational group (n = 150). Both groups consist of cow’s milk-based formula which were generally identical in terms of composition, except that the investigational formula was additionally supplemented with PLs and LCPUFAs. The infants were followed for the first year of life. Breastfed infants were the reference (n = 150). Bayley Scales of Infant Development [3rd edition (Bayley-III)], Carey Toddler Temperament Scales (TTS), MacArthur-Bates Communicative Development Inventories (CDI), Single Object Attention and Free Play Tasks were used to evaluate neurodevelopmental outcomes of infant at 365 days of age. In addition, Ages and Stages Questionnaires (ASQ) were also conducted at 120, 180, and 275 days of age. Compared to breastfeeding, both infant formulas were well-tolerated and provided adequate growth, with no adverse events being reported throughout the study. Infants of the investigational group showed higher mean scores in Bayley-III cognitive performance (104.3 vs. 99.0, p < 0.05), language (106.9 vs. 104.5, p < 0.05), and motor skills (109.2 vs. 103.9, p < 0.05) compared the control group. Similar results were being reported for other developmental scales including TTS and ASQ. Notably, the test scores of infants fed the investigational formula were similar to those who were breastfed. Our results indicate that PL and LCPUFA supplementation may be beneficial for neurodevelopment of infants throughout the first year of life. Further studies are needed to investigation long-term effects PL and LCPUFA on neurodevelopment in early life

Directory of Open Access Journals

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Author: Arora Sanjeev
Li Zhiyuan
Lyu Kaifeng
Publication venue
Publication date: 14/06/2022
Field of study

Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.Comment: 68 pages, many figure

arXiv.org e-Print Archive