Search CORE

313 research outputs found

On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms

Author: Nguyen Lam M.
Tran Trang H.
Publication venue
Publication date: 25/10/2023
Field of study

Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.Comment: The 37th Conference on Neural Information Processing Systems (NeurIPS 2023

arXiv.org e-Print Archive

RNN training along locally optimal trajectoriesvia Frank-Wolfe algorithm

Author: Li Ming
Saligrama Venkatesh
Yue Yun
Zhang Ziming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than backpropagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of O (1/ϵ) for ϵ error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term dependencies, and show significant performance improvement. We also experiment with deep RNN architectures and show efficient training performance. Finally, we demonstrate that our training method is robust to noisy data.https://doi.org/10.1109/icpr48806.2021.9412188Accepted manuscrip

Boston University Institutional Repository (OpenBU)