165 research outputs found

    Efficient Task Replication for Fast Response Times in Parallel Computation

    Full text link
    One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel" computation. For this type of computation, one challenge is that the time to execute a task for each machine is inherently variable, and the overall response time is constrained by the execution time of the slowest machine. To address this issue, system designers introduce task replication, which sends the same task to multiple machines, and obtains result from the machine that finishes first. While task replication reduces response time, it usually increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. We show that, while in general, there is a tension between response time and resource usage, there exist scenarios where replicating tasks judiciously reduces completion time and resource usage simultaneously. Given the execution time distribution for machines, we investigate the conditions for a scheduling policy to achieve optimal performance trade-off, and propose efficient algorithms to search for optimal or near-optimal scheduling policies. Our analysis gives insights on when and why replication helps, which can be used to guide scheduler design in large-scale distributed computing systems.Comment: Extended version of the 2-page paper accepted to ACM SIGMETRICS 201

    Straggler-Resilient Distributed Computing

    Get PDF
    In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin
    corecore