1,154 research outputs found
Recommended from our members
Fundamental Results on Asynchronous Parallel Optimization Algorithms
In this thesis, we present a body of work on the performance and convergence properties of asynchronous-parallel algorithms completed over the course of my doctorate degree (Hannah, Feng, and Wotao Yin 2018; Hannah and Wotao Yin 2017b; T. Sun, Hannah, and Wotao Yin 2017; Hannah and Wotao Yin 2017a). Asynchronous algorithms eliminate the costly synchronization penalty of traditional synchronous-parallel algorithms. They do this by having computing nodes utilize the most recently available information to compute updates. However, it’s not immediately clear whether the trade-off of eliminating synchronization penalty at the cost of using outdated information is favorable.We first give a comprehensive theoretical justification of the performance advantages of asynchronous algorithms, which we summarize as "Faster Iterations, Same Quality" (Hannah and Wotao Yin 2017a). Under a well-justified model, we show that asynchronous algorithms complete "Faster Iterations". Using renewal theory, we demonstrate how network delays, heterogeneous sub-problem difficulty and computing power greatly hinder synchronous algorithms, but have no impact on their asynchronous counterparts. We next prove the first exact convergence rate results for a variety of synchronous algorithms including synchronous ARock and synchronous randomized block coordinate descent (sync-RBCD). This allows us to make a fair comparison between these algorithms and their asynchronous counterparts. Finally, we show that a variety of asynchronous algorithms have a convergence rate that essentially matches the previously derived exact rates for synchronous counterparts so long as the delays are not too large. Hence asynchronous algorithms complete faster iteration that are of the "Same Quality" as synchronous algorithms. Therefore we conclude that a wide variety of asynchronous algorithms will always outcompete their synchronous counterparts if the delays are not too large, and especially at scale.Next, we present the first asynchonous Nesterov-accelerated algorithm that attains a speedup: A2BCD (Hannah, Feng, and Wotao Yin 2018). We first prove that A2BCD attains NU_ACDM’s complexity to highest order. NU_ACDM is a state-of-the-art accelerated coordinate descent algorithm (Allen-Zhu, Qu, et al. 2016). Then we show that both A2BCD and NU_ACDM both have optimal complexity. Hence because A2BCD has faster iterations, and optimal complexity, it should be the fastest coordinate descent algorithm. We verify this with numerical experiments comparing A2BCD with NU_ACDM. We find that A2BCD is up to 4-5x faster than NU_ACDM, and hence conclude that our algorithm is the current fastest coordinate descent algorithm that exists. Finally, we derive a second-order ODE, which is the continuous-time limit of A2BCD. The ODE analysis motivates and clarifies our proof strategy.Lastly, we present earlier foundational work that comprises the basis of the technical innovations that made the previous results possible (Hannah and Wotao Yin 2017b). We show that ARock and its many special cases may converge even under unbounded delays (both stochastic and deterministic). These results sidestep longstanding impossibility results derived in the 1980s by making slightly stronger assumptions. They were also an early demonstration of the power of meticulous Lyapunov-function construction techniques pioneered in this body of work
A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm
We develop and analyze an asynchronous algorithm for distributed convex
optimization when the objective writes a sum of smooth functions, local to each
worker, and a non-smooth function. Unlike many existing methods, our
distributed algorithm is adjustable to various levels of communication cost,
delays, machines computational power, and functions smoothness. A unique
feature is that the stepsizes do not depend on communication delays nor number
of machines, which is highly desirable for scalability. We prove that the
algorithm converges linearly in the strongly convex case, and provide
guarantees of convergence for the non-strongly convex case. The obtained rates
are the same as the vanilla proximal gradient algorithm over some introduced
epoch sequence that subsumes the delays of the system. We provide numerical
results on large-scale machine learning problems to demonstrate the merits of
the proposed method.Comment: to appear in SIAM Journal on Optimizatio
Adaptiveness, Asynchrony, and Resource Efficiency in Parallel Stochastic Gradient Descent
Accelerated digitalization and sensor deployment in society in recent years poses critical challenges for associated data processing and analysis infrastructure to scale, and the field of big data, targeting methods for storing, processing, and revealing patterns in huge data sets, has surged. Artificial Intelligence (AI) models are used diligently in standard Big Data pipelines due to their tremendous success across various data analysis tasks, however exponential growth in Volume, Variety and Velocity of Big Data (known as its three V’s) in recent years require associated complexity in the AI models that analyze it, as well as the Machine Learning (ML) processes required to train them. In order to cope, parallelism in ML is standard nowadays, with the aim to better utilize contemporary computing infrastructure, whether it being shared-memory multi-core CPUs, or vast connected networks of IoT devices engaging in Federated Learning (FL).Stochastic Gradient Descent (SGD) serves as the backbone of many of the most popular ML methods, including in particular Deep Learning. However, SGD has inherently sequential semantics, and is not trivially parallelizable without imposing strict synchronization, with associated bottlenecks. Asynchronous SGD (AsyncSGD), which relaxes the original semantics, has gained significant interest in recent years due to promising results that show speedup in certain contexts. However, the relaxed semantics that asynchrony entails give rise to fundamental questions regarding AsyncSGD, relating particularly to its stability and convergence rate in practical applications.This thesis explores vital knowledge gaps of AsyncSGD, and contributes in particular to: Theoretical frameworks – Formalization of several key notions related to the impact of asynchrony on the convergence, guiding future development of AsyncSGD implementations; Analytical results – Asymptotic convergence bounds under realistic assumptions. Moreover, several technical solutions are proposed, targeting in particular: Stability – Reducing the number of non-converging executions and the associated wasted energy; Speedup – Improving convergence time and reliability with instance-based adaptiveness; Elasticity – Resource-efficiency by avoiding over-parallelism, and thereby improving stability and saving computing resources. The proposed methods are evaluated on several standard DL benchmarking applications and compared to relevant baselines from previous literature. Key results include: (i) persistent speedup compared to baselines, (ii) increased stability and reduced risk for non-converging executions, (iii) reduction in the overall memory footprint (up to 17%), as well as the consumed computing resources (up to 67%).In addition, along with this thesis, an open-source implementation is published, that connects high-level ML operations with asynchronous implementations with fine-grained memory operations, leveraging future research for efficient adaptation of AsyncSGD for practical applications
Bridging the gap between algorithmic and learned index structures
Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion.
In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine.
Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream.
Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces
- …