19 research outputs found

    Download and Access Trade-offs in Lagrange Coded Computing

    Get PDF
    Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their content, thus providing resiliency to the remaining stragglers. However, in the most common case in which the number of stragglers is less than in the worst case scenario, much of the computational power of the system remains unexploited. To amend this issue, in this paper we expand LCC by studying a fundamental trade-off between download and access, and present two contributions. In the first contribution, it is shown that without any modification to the encoding process, the master can decode the computations by accessing a larger number of nodes, however downloading less information from each node in comparison with LCC (i.e., trading access for download). This scheme relies on decoding a particular polynomial in the ideal that is generated by the polynomials of interest, a technique we call Ideal Decoding. This new scheme also improves LCC in the sense that for systems with adversaries, the overall downloaded bandwidth is smaller than in LCC. In the second contribution we study a real-time model of this trade-off, in which the data from the workers is downloaded sequentially. By clustering nodes of similar delays and encoding the function with Universally Decodable Matrices, the master can decode once sufficient data is downloaded from every cluster, regardless of the internal delays within that cluster. This allows the master to utilize the partial work that is done by stragglers, rather than to ignore it, a feature that most past works in coded computing are lacking

    Straggler-Resilient Distributed Computing

    Get PDF
    In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin

    Randomized Polar Codes for Anytime Distributed Machine Learning

    Full text link
    We present a novel distributed computing framework that is robust to slow compute nodes, and is capable of both approximate and exact computation of linear operations. The proposed mechanism integrates the concepts of randomized sketching and polar codes in the context of coded computation. We propose a sequential decoding algorithm designed to handle real valued data while maintaining low computational complexity for recovery. Additionally, we provide an anytime estimator that can generate provably accurate estimates even when the set of available node outputs is not decodable. We demonstrate the potential applications of this framework in various contexts, such as large-scale matrix multiplication and black-box optimization. We present the implementation of these methods on a serverless cloud computing system and provide numerical results to demonstrate their scalability in practice, including ImageNet scale computations

    Securely Aggregated Coded Matrix Inversion

    Full text link
    Coded computing is a method for mitigating straggling workers in a centralized computing network, by using erasure-coding techniques. Federated learning is a decentralized model for training data distributed across client devices. In this work we propose approximating the inverse of an aggregated data matrix, where the data is generated by clients; similar to the federated learning paradigm, while also being resilient to stragglers. To do so, we propose a coded computing method based on gradient coding. We modify this method so that the coordinator does not access the local data at any point; while the clients access the aggregated matrix in order to complete their tasks. The network we consider is not centrally administrated, and the communications which take place are secure against potential eavesdroppers.Comment: arXiv admin note: substantial text overlap with arXiv:2207.0627

    Download and Access Trade-offs in Lagrange Coded Computing

    Get PDF
    Lagrange Coded Computing (LCC) is a recently proposed technique for resilient, secure, and private computation of arbitrary polynomials in distributed environments. By mapping such computations to composition of polynomials, LCC allows the master node to complete the computation by accessing a minimal number of workers and downloading all of their content, thus providing resiliency to the remaining stragglers. However, in the most common case in which the number of stragglers is less than in the worst case scenario, much of the computational power of the system remains unexploited. To amend this issue, in this paper we expand LCC by studying a fundamental trade-off between download and access, and present two contributions. In the first contribution, it is shown that without any modification to the encoding process, the master can decode the computations by accessing a larger number of nodes, however downloading less information from each node in comparison with LCC (i.e., trading access for download). This scheme relies on decoding a particular polynomial in the ideal that is generated by the polynomials of interest, a technique we call Ideal Decoding. This new scheme also improves LCC in the sense that for systems with adversaries, the overall downloaded bandwidth is smaller than in LCC. In the second contribution we study a real-time model of this trade-off, in which the data from the workers is downloaded sequentially. By clustering nodes of similar delays and encoding the function with Universally Decodable Matrices, the master can decode once sufficient data is downloaded from every cluster, regardless of the internal delays within that cluster. This allows the master to utilize the partial work that is done by stragglers, rather than to ignore it, a feature that most past works in coded computing are lacking

    Storage Codes with Flexible Number of Nodes

    Full text link
    This paper presents flexible storage codes, a class of error-correcting codes that can recover information from a flexible number of storage nodes. As a result, one can make a better use of the available storage nodes in the presence of unpredictable node failures and reduce the data access latency. Let us assume a storage system encodes kℓk\ell information symbols over a finite field F\mathbb{F} into nn nodes, each of size ℓ\ell symbols. The code is parameterized by a set of tuples {(Rj,kj,ℓj):1≤j≤a}\{(R_j,k_j,\ell_j): 1 \le j \le a\}, satisfying k1ℓ1=k2ℓ2=...=kaℓak_1\ell_1=k_2\ell_2=...=k_a\ell_a and k1>k2>...>ka=k,ℓa=ℓk_1>k_2>...>k_a = k, \ell_a=\ell, such that the information symbols can be reconstructed from any RjR_j nodes, each node accessing ℓj\ell_j symbols. In other words, the code allows a flexible number of nodes for decoding to accommodate the variance in the data access time of the nodes. Code constructions are presented for different storage scenarios, including LRC (locally recoverable) codes, PMDS (partial MDS) codes, and MSR (minimum storage regenerating) codes. We analyze the latency of accessing information and perform simulations on Amazon clusters to show the efficiency of presented codes

    Error-Correcting Codes for Networks, Storage and Computation

    Get PDF
    The advent of the information age has bestowed upon us three challenges related to the way we deal with data. Firstly, there is an unprecedented demand for transmitting data at high rates. Secondly, the massive amounts of data being collected from various sources needs to be stored across time. Thirdly, there is a need to process the data collected and perform computations on it in order to extract meaningful information out of it. The interconnected nature of modern systems designed to perform these tasks has unraveled new difficulties when it comes to ensuring their resilience against sources of performance degradation. In the context of network communication and distributed data storage, system-level noise and adversarial errors have to be combated with efficient error correction schemes. In the case of distributed computation, the heterogeneous nature of computing clusters can potentially diminish the speedups promised by parallel algorithms, calling for schemes that mitigate the effect of slow machines and communication delay. This thesis addresses the problem of designing efficient fault tolerance schemes for the three scenarios just described. In the network communication setting, a family of multiple-source multicast networks that employ linear network coding is considered for which capacity-achieving distributed error-correcting codes, based on classical algebraic constructions, are designed. The codes require no coordination between the source nodes and are end to end: except for the source nodes and the destination node, the operation of the network remains unchanged. In the context of data storage, balanced error-correcting codes are constructed so that the encoding effort required is balanced out across the storage nodes. In particular, it is shown that for a fixed row weight, any cyclic Reed-Solomon code possesses a generator matrix in which the number of nonzeros is the same across the columns. In the balanced and sparsest case, where each row of the generator matrix is a minimum distance codeword, the maximal encoding time over the storage nodes is minimized, a property that is appealing in write-intensive settings. Analogous constructions are presented for a locally recoverable code construction due to Tamo and Barg. Lastly, the problem of mitigating stragglers in a distributed computation setup is addressed, where a function of some dataset is computed in parallel. Using Reed-Solomon coding techniques, a scheme is proposed that allows for the recovery of the function under consideration from the minimum number of machines possible. The only assumption made on the function is that it is additively separable, which renders the scheme useful in distributed gradient descent implementations. Furthermore, a theoretical model for the run time of the scheme is presented. When the return time of the machines is modeled probabilistically, the model can be used to optimally pick the scheme's parameters so that the expected computation time is minimized. The recovery is performed using an algorithm that runs in quadratic time and linear space, a notable improvement compared to state-of-the-art schemes. The unifying theme of the three scenarios is the construction of error-correcting codes whose encoding functions adhere to certain constraints. It is shown that in many cases, these constraints can be satisfied by classical constructions. As a result, the schemes presented are deterministic, operate over small finite fields and can be decoded using efficient algorithms.</p