Search CORE

19 research outputs found

Download and Access Trade-offs in Lagrange Coded Computing

Author: Avestimehr Salman
Bruck Jehoshua
Raviv Netanel
Yu Qian
Publication venue: 'California Institute of Technology Library'
Publication date: 10/01/2019
Field of study

Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their content, thus providing resiliency to the remaining stragglers. However, in the most common case in which the number of stragglers is less than in the worst case scenario, much of the computational power of the system remains unexploited. To amend this issue, in this paper we expand LCC by studying a fundamental trade-off between download and access, and present two contributions. In the first contribution, it is shown that without any modification to the encoding process, the master can decode the computations by accessing a larger number of nodes, however downloading less information from each node in comparison with LCC (i.e., trading access for download). This scheme relies on decoding a particular polynomial in the ideal that is generated by the polynomials of interest, a technique we call Ideal Decoding. This new scheme also improves LCC in the sense that for systems with adversaries, the overall downloaded bandwidth is smaller than in LCC. In the second contribution we study a real-time model of this trade-off, in which the data from the workers is downloaded sequentially. By clustering nodes of similar delays and encoding the function with Universally Decodable Matrices, the master can decode once sufficient data is downloaded from every cluster, regardless of the internal delays within that cluster. This allows the master to utilize the partial work that is done by stragglers, rather than to ignore it, a feature that most past works in coded computing are lacking

Crossref

Caltech Authors

Straggler-Resilient Distributed Computing

Author: Severinson Lars Albin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin

University of Bergen

NORA - Norwegian Open Research Archives

Randomized Polar Codes for Anytime Distributed Machine Learning

Author: Bartan Burak
Pilanci Mert
Publication venue
Publication date: 01/09/2023
Field of study

We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. The proposed mechanism integrates the concepts of randomized sketching and polar codes in the context of coded computation. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. Additionally, we provide an anytime estimator that can generate provably accurate estimates even when the set of available node outputs is not decodable. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization. We present the implementation of these methods on a serverless cloud computing system and provide numerical results to demonstrate their scalability in practice, including ImageNet scale computations

arXiv.org e-Print Archive

Recommended from our members

DISTRIBUTED LEARNING ALGORITHMS: COMMUNICATION EFFICIENCY AND ERROR RESILIENCE

Author: Maity Raj Kumar
Publication venue: ScholarWorks@UMass Amherst
Publication date: 18/03/2022
Field of study

In modern day machine learning applications such as self-driving cars, recommender systems, robotics, genetics etc., the size of the training data has grown to the point that it has become essential to design distributed learning algorithms. A general framework for the distributed learning is \emph{data parallelism} where the data is distributed among the \emph{worker machines} for parallel processing and computation to speed up learning. With billions of devices such as cellphones, computers etc., the data is inherently distributed and stored locally in the users\u27 devices. Learning in this set up is popularly known as \emph{Federated Learning}. The speed-up due to distributed framework gets hindered by some fundamental problems such as straggler workers, communication bottleneck due to high communication overhead between workers and central server, adversarial failure popularly know as \emph{Byzantine failure}. In this thesis, we study and develop distributed algorithms that are error resilient and communication efficient. First, we address the problem of straggler workers where the learning is delayed due to slow workers in the distributed setup. To mitigate the effect of the stragglers, we employ \textbf{LDPC} (low density parity check) code to encode the data and implement gradient descent algorithm in the distributed setup. Second, we present a family of vector quantization schemes \emph{vqSGD} (vector quantized Stochastic Gradient Descent ) that provides an asymptotic reduction in the communication cost with convergence guarantees in the first order distributed optimization. We also showed that \emph{vqSGD} provides strong privacy guarantee. Third, we address the problem of Byzantine failure together with communication-efficiency in the first order gradient descent algorithm. We consider a generic class of

\delta

- approximate compressor for communication efficiency and employ a simple \emph{norm based thresholding} scheme to make the learning algorithm robust to Byzantine failures. We establish statistical error rate for non-convex smooth loss. Moreover, we analyze the compressed gradient descent algorithm with error feedback in a distributed setting and in the presence of Byzantine worker machines. Fourth, we employ the generic class of

\delta

- approximate compressor to develop a communication efficient second order Newton-type algorithm and provide rate of convergence for smooth objective. Fifth, we propose \textbf{COMRADE} (COMmunication-efficient and Robust Approximate Distributed nEwton ), an iterative second order algorithm that is communication efficient as well as robust against Byzantine failures. Sixth, we propose a distributed \emph{cubic-regularized Newton } algorithm that can escape saddle points effectively for non-convex loss function and find a local minima . Furthermore, the proposed algorithm can resist the attack of the Byzantine machines, which may create \emph{fake local minima} near the saddle points of the loss function, also known as saddle-point attack

ScholarWorks@UMass Amherst

Securely Aggregated Coded Matrix Inversion

Author: Charalambides Neophytos
Hero Alfred
Pilanci Mert
Publication venue
Publication date: 04/09/2023
Field of study

Coded computing is a method for mitigating straggling workers in a centralized computing network, by using erasure-coding techniques. Federated learning is a decentralized model for training data distributed across client devices. In this work we propose approximating the inverse of an aggregated data matrix, where the data is generated by clients; similar to the federated learning paradigm, while also being resilient to stragglers. To do so, we propose a coded computing method based on gradient coding. We modify this method so that the coordinator does not access the local data at any point; while the clients access the aggregated matrix in order to complete their tasks. The network we consider is not centrally administrated, and the communications which take place are secure against potential eavesdroppers.Comment: arXiv admin note: substantial text overlap with arXiv:2207.0627

arXiv.org e-Print Archive

Download and Access Trade-offs in Lagrange Coded Computing

Author: Avestimehr Salman
Bruck Jehoshua
Raviv Netanel
Yu Qian
Publication venue: 'California Institute of Technology Library'
Publication date: 10/01/2019
Field of study

Storage Codes with Flexible Number of Nodes

Author: Jafarkhani Hamid
Li Weiqi
Lu Taiting
Wang Zhiying
Publication venue
Publication date: 21/06/2021
Field of study

This paper presents flexible storage codes, a class of error-correcting codes that can recover information from a flexible number of storage nodes. As a result, one can make a better use of the available storage nodes in the presence of unpredictable node failures and reduce the data access latency. Let us assume a storage system encodes

k\ell

information symbols over a finite field

\mathbb{F}

into

n

nodes, each of size

\ell

symbols. The code is parameterized by a set of tuples

\{(R_j,k_j,\ell_j): 1 \le j \le a\}

, satisfying

k_1\ell_1=k_2\ell_2=...=k_a\ell_a

and

k_1>k_2>...>k_a = k, \ell_a=\ell

, such that the information symbols can be reconstructed from any

R_j

nodes, each node accessing

\ell_j

symbols. In other words, the code allows a flexible number of nodes for decoding to accommodate the variance in the data access time of the nodes. Code constructions are presented for different storage scenarios, including LRC (locally recoverable) codes, PMDS (partial MDS) codes, and MSR (minimum storage regenerating) codes. We analyze the latency of accessing information and perform simulations on Amazon clusters to show the efficiency of presented codes

arXiv.org e-Print Archive

eScholarship - University of California

Error-Correcting Codes for Networks, Storage and Computation

Author: Halbawi Wael
Publication venue
Publication date: 01/01/2017
Field of study

The advent of the information age has bestowed upon us three challenges related to the way we deal with data. Firstly, there is an unprecedented demand for transmitting data at high rates. Secondly, the massive amounts of data being collected from various sources needs to be stored across time. Thirdly, there is a need to process the data collected and perform computations on it in order to extract meaningful information out of it. The interconnected nature of modern systems designed to perform these tasks has unraveled new difficulties when it comes to ensuring their resilience against sources of performance degradation. In the context of network communication and distributed data storage, system-level noise and adversarial errors have to be combated with efficient error correction schemes. In the case of distributed computation, the heterogeneous nature of computing clusters can potentially diminish the speedups promised by parallel algorithms, calling for schemes that mitigate the effect of slow machines and communication delay. This thesis addresses the problem of designing efficient fault tolerance schemes for the three scenarios just described. In the network communication setting, a family of multiple-source multicast networks that employ linear network coding is considered for which capacity-achieving distributed error-correcting codes, based on classical algebraic constructions, are designed. The codes require no coordination between the source nodes and are end to end: except for the source nodes and the destination node, the operation of the network remains unchanged. In the context of data storage, balanced error-correcting codes are constructed so that the encoding effort required is balanced out across the storage nodes. In particular, it is shown that for a fixed row weight, any cyclic Reed-Solomon code possesses a generator matrix in which the number of nonzeros is the same across the columns. In the balanced and sparsest case, where each row of the generator matrix is a minimum distance codeword, the maximal encoding time over the storage nodes is minimized, a property that is appealing in write-intensive settings. Analogous constructions are presented for a locally recoverable code construction due to Tamo and Barg. Lastly, the problem of mitigating stragglers in a distributed computation setup is addressed, where a function of some dataset is computed in parallel. Using Reed-Solomon coding techniques, a scheme is proposed that allows for the recovery of the function under consideration from the minimum number of machines possible. The only assumption made on the function is that it is additively separable, which renders the scheme useful in distributed gradient descent implementations. Furthermore, a theoretical model for the run time of the scheme is presented. When the return time of the machines is modeled probabilistically, the model can be used to optimally pick the scheme's parameters so that the expected computation time is minimized. The recovery is performed using an algorithm that runs in quadratic time and linear space, a notable improvement compared to state-of-the-art schemes. The unifying theme of the three scenarios is the construction of error-correcting codes whose encoding functions adhere to certain constraints. It is shown that in many cases, these constraints can be satisfied by classical constructions. As a result, the schemes presented are deterministic, operate over small finite fields and can be decoded using efficient algorithms.</p

Caltech Theses and Dissertations