19 research outputs found
Download and Access Trade-offs in Lagrange Coded Computing
Lagrange Coded Computing (LCC) is a recently
proposed technique for resilient, secure, and private computation
of arbitrary polynomials in distributed environments. By
mapping such computations to composition of polynomials, LCC
allows the master node to complete the computation by accessing
a minimal number of workers and downloading all of their
content, thus providing resiliency to the remaining stragglers.
However, in the most common case in which the number of
stragglers is less than in the worst case scenario, much of the
computational power of the system remains unexploited. To
amend this issue, in this paper we expand LCC by studying a
fundamental trade-off between download and access, and present
two contributions. In the first contribution, it is shown that
without any modification to the encoding process, the master
can decode the computations by accessing a larger number of
nodes, however downloading less information from each node in
comparison with LCC (i.e., trading access for download). This
scheme relies on decoding a particular polynomial in the ideal
that is generated by the polynomials of interest, a technique we
call Ideal Decoding. This new scheme also improves LCC in the
sense that for systems with adversaries, the overall downloaded
bandwidth is smaller than in LCC. In the second contribution
we study a real-time model of this trade-off, in which the data
from the workers is downloaded sequentially. By clustering nodes
of similar delays and encoding the function with Universally
Decodable Matrices, the master can decode once sufficient data is
downloaded from every cluster, regardless of the internal delays
within that cluster. This allows the master to utilize the partial
work that is done by stragglers, rather than to ignore it, a feature
that most past works in coded computing are lacking
Straggler-Resilient Distributed Computing
In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system.
I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system.
In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin
Randomized Polar Codes for Anytime Distributed Machine Learning
We present a novel distributed computing framework that is robust to slow
compute nodes, and is capable of both approximate and exact computation of
linear operations. The proposed mechanism integrates the concepts of randomized
sketching and polar codes in the context of coded computation. We propose a
sequential decoding algorithm designed to handle real valued data while
maintaining low computational complexity for recovery. Additionally, we provide
an anytime estimator that can generate provably accurate estimates even when
the set of available node outputs is not decodable. We demonstrate the
potential applications of this framework in various contexts, such as
large-scale matrix multiplication and black-box optimization. We present the
implementation of these methods on a serverless cloud computing system and
provide numerical results to demonstrate their scalability in practice,
including ImageNet scale computations
Recommended from our members
DISTRIBUTED LEARNING ALGORITHMS: COMMUNICATION EFFICIENCY AND ERROR RESILIENCE
In modern day machine learning applications such as self-driving cars, recommender systems, robotics, genetics etc., the size of the training data has grown to the point that it has become essential to design distributed learning algorithms. A general framework for the distributed learning is \emph{data parallelism} where the data is distributed among the \emph{worker machines} for parallel processing and computation to speed up learning. With billions of devices such as cellphones, computers etc., the data is inherently distributed and stored locally in the users\u27 devices. Learning in this set up is popularly known as \emph{Federated Learning}. The speed-up due to distributed framework gets hindered by some fundamental problems such as straggler workers, communication bottleneck due to high communication overhead between workers and central server, adversarial failure popularly know as \emph{Byzantine failure}. In this thesis, we study and develop distributed algorithms that are error resilient and communication efficient.
First, we address the problem of straggler workers where the learning is delayed due to slow workers in the distributed setup. To mitigate the effect of the stragglers, we employ \textbf{LDPC} (low density parity check) code to encode the data and implement gradient descent algorithm in the distributed setup. Second, we present a family of vector quantization schemes \emph{vqSGD} (vector quantized Stochastic Gradient Descent ) that provides an asymptotic reduction in the communication cost with convergence guarantees in the first order distributed optimization. We also showed that \emph{vqSGD} provides strong privacy guarantee. Third, we address the problem of Byzantine failure together with communication-efficiency in the first order gradient descent algorithm. We consider a generic class of - approximate compressor for communication efficiency and employ a simple \emph{norm based thresholding} scheme to make the learning algorithm robust to Byzantine failures. We establish statistical error rate for non-convex smooth loss. Moreover, we analyze the compressed gradient descent algorithm with error feedback in a distributed setting and in the presence of Byzantine worker machines. Fourth, we employ the generic class of - approximate compressor to develop a communication efficient second order Newton-type algorithm and provide rate of convergence for smooth objective. Fifth, we propose \textbf{COMRADE} (COMmunication-efficient and Robust Approximate Distributed nEwton ), an iterative second order algorithm that is communication efficient as well as robust against Byzantine failures. Sixth, we propose a distributed \emph{cubic-regularized Newton } algorithm that can escape saddle points effectively for non-convex loss function and find a local minima . Furthermore, the proposed algorithm can resist the attack of the Byzantine machines, which may create \emph{fake local minima} near the saddle points of the loss function, also known as saddle-point attack
Securely Aggregated Coded Matrix Inversion
Coded computing is a method for mitigating straggling workers in a
centralized computing network, by using erasure-coding techniques. Federated
learning is a decentralized model for training data distributed across client
devices. In this work we propose approximating the inverse of an aggregated
data matrix, where the data is generated by clients; similar to the federated
learning paradigm, while also being resilient to stragglers. To do so, we
propose a coded computing method based on gradient coding. We modify this
method so that the coordinator does not access the local data at any point;
while the clients access the aggregated matrix in order to complete their
tasks. The network we consider is not centrally administrated, and the
communications which take place are secure against potential eavesdroppers.Comment: arXiv admin note: substantial text overlap with arXiv:2207.0627
Download and Access Trade-offs in Lagrange Coded Computing
Lagrange Coded Computing (LCC) is a recently
proposed technique for resilient, secure, and private computation
of arbitrary polynomials in distributed environments. By
mapping such computations to composition of polynomials, LCC
allows the master node to complete the computation by accessing
a minimal number of workers and downloading all of their
content, thus providing resiliency to the remaining stragglers.
However, in the most common case in which the number of
stragglers is less than in the worst case scenario, much of the
computational power of the system remains unexploited. To
amend this issue, in this paper we expand LCC by studying a
fundamental trade-off between download and access, and present
two contributions. In the first contribution, it is shown that
without any modification to the encoding process, the master
can decode the computations by accessing a larger number of
nodes, however downloading less information from each node in
comparison with LCC (i.e., trading access for download). This
scheme relies on decoding a particular polynomial in the ideal
that is generated by the polynomials of interest, a technique we
call Ideal Decoding. This new scheme also improves LCC in the
sense that for systems with adversaries, the overall downloaded
bandwidth is smaller than in LCC. In the second contribution
we study a real-time model of this trade-off, in which the data
from the workers is downloaded sequentially. By clustering nodes
of similar delays and encoding the function with Universally
Decodable Matrices, the master can decode once sufficient data is
downloaded from every cluster, regardless of the internal delays
within that cluster. This allows the master to utilize the partial
work that is done by stragglers, rather than to ignore it, a feature
that most past works in coded computing are lacking
Storage Codes with Flexible Number of Nodes
This paper presents flexible storage codes, a class of error-correcting codes
that can recover information from a flexible number of storage nodes. As a
result, one can make a better use of the available storage nodes in the
presence of unpredictable node failures and reduce the data access latency. Let
us assume a storage system encodes information symbols over a finite
field into nodes, each of size symbols. The code is
parameterized by a set of tuples ,
satisfying and , such that the information symbols can be reconstructed from any
nodes, each node accessing symbols. In other words, the code
allows a flexible number of nodes for decoding to accommodate the variance in
the data access time of the nodes. Code constructions are presented for
different storage scenarios, including LRC (locally recoverable) codes, PMDS
(partial MDS) codes, and MSR (minimum storage regenerating) codes. We analyze
the latency of accessing information and perform simulations on Amazon clusters
to show the efficiency of presented codes
Error-Correcting Codes for Networks, Storage and Computation
The advent of the information age has bestowed upon us three challenges related to the way we deal with data. Firstly, there is an unprecedented demand for transmitting data at high rates. Secondly, the massive amounts of data being collected from various sources needs to be stored across time. Thirdly, there is a need to process the data collected and perform computations on it in order to extract meaningful information out of it. The interconnected nature of modern systems designed to perform these tasks has unraveled new difficulties when it comes to ensuring their resilience against sources of performance degradation. In the context of network communication and distributed data storage, system-level noise and adversarial errors have to be combated with efficient error correction schemes. In the case of distributed computation, the heterogeneous nature of computing clusters can potentially diminish the speedups promised by parallel algorithms, calling for schemes that mitigate the effect of slow machines and communication delay.
This thesis addresses the problem of designing efficient fault tolerance schemes for the three scenarios just described. In the network communication setting, a family of multiple-source multicast networks that employ linear network coding is considered for which capacity-achieving distributed error-correcting codes, based on classical algebraic constructions, are designed. The codes require no coordination between the source nodes and are end to end: except for the source nodes and the destination node, the operation of the network remains unchanged.
In the context of data storage, balanced error-correcting codes are constructed so that the encoding effort required is balanced out across the storage nodes. In particular, it is shown that for a fixed row weight, any cyclic Reed-Solomon code possesses a generator matrix in which the number of nonzeros is the same across the columns. In the balanced and sparsest case, where each row of the generator matrix is a minimum distance codeword, the maximal encoding time over the storage nodes is minimized, a property that is appealing in write-intensive settings. Analogous constructions are presented for a locally recoverable code construction due to Tamo and Barg.
Lastly, the problem of mitigating stragglers in a distributed computation setup is addressed, where a function of some dataset is computed in parallel. Using Reed-Solomon coding techniques, a scheme is proposed that allows for the recovery of the function under consideration from the minimum number of machines possible. The only assumption made on the function is that it is additively separable, which renders the scheme useful in distributed gradient descent implementations. Furthermore, a theoretical model for the run time of the scheme is presented. When the return time of the machines is modeled probabilistically, the model can be used to optimally pick the scheme's parameters so that the expected computation time is minimized. The recovery is performed using an algorithm that runs in quadratic time and linear space, a notable improvement compared to state-of-the-art schemes.
The unifying theme of the three scenarios is the construction of error-correcting codes whose encoding functions adhere to certain constraints. It is shown that in many cases, these constraints can be satisfied by classical constructions. As a result, the schemes presented are deterministic, operate over small finite fields and can be decoded using efficient algorithms.</p