Search CORE

77 research outputs found

Improved Latency-Communication Trade-Off for Map-Shuffle-Reduce Systems with Stragglers

Author: Simeone Osvaldo
Zhang Jingjing
Publication venue
Publication date: 21/08/2018
Field of study

In a distributed computing system operating according to the map-shuffle-reduce framework, coding data prior to storage can be useful both to reduce the latency caused by straggling servers and to decrease the inter-server communication load in the shuffling phase. In prior work, a concatenated coding scheme was proposed for a matrix multiplication task. In this scheme, the outer Maximum Distance Separable (MDS) code is leveraged to correct erasures caused by stragglers, while the inner repetition code is used to improve the communication efficiency in the shuffling phase by means of coded multicasting. In this work, it is demonstrated that it is possible to leverage the redundancy created by repetition coding in order to increase the rate of the outer MDS code and hence to increase the multicasting opportunities in the shuffling phase. As a result, the proposed approach is shown to improve over the best known latency-communication overhead trade-off.Comment: 11 pages, 4 figure

arXiv.org e-Print Archive

Crossref

King's Research Portal

Coded Federated Computing in Wireless Networks with Straggling Devices and Imperfect CSI

Author: Ha Sukjong
Kang Joonhyuk
Simeone Osvaldo
Zhang Jingjing
Publication venue
Publication date: 16/01/2019
Field of study

Distributed computing platforms typically assume the availability of reliable and dedicated connections among the processors. This work considers an alternative scenario, relevant for wireless data centers and federated learning, in which the distributed processors, operating on generally distinct coded data, are connected via shared wireless channels accessed via full-duplex transmission. The study accounts for both wireless and computing impairments, including interference, imperfect Channel State Information, and straggling processors, and it assumes a Map-Shuffle-Reduce coded computing paradigm. The total latency of the system, obtained as the sum of computing and communication delays, is studied for different shuffling strategies revealing the interplay between distributed computing, coding, and cooperative or coordinated transmission.Comment: Submitted for possible conference publicatio

arXiv.org e-Print Archive

Crossref

King's Research Portal

Straggler-Resilient Distributed Computing

Author: Severinson Lars Albin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin

University of Bergen

NORA - Norwegian Open Research Archives

Straggler mitigation in hadoop mapreduce framework: a review

Author: Abu Bakar Kamalrulnizam
Ajibade Lukuman Saheed
Aliyu Ahmed
Publication venue: 'The Science and Information Organization'
Publication date: 01/01/2022
Field of study

Processing huge and complex data to obtain useful information is challenging, even though several big data processing frameworks have been proposed and further enhanced. One of the prominent big data processing frameworks is MapReduce. The main concept of MapReduce framework relies on distributed and parallel processing. However, MapReduce framework is facing serious performance degradations due to the slow execution of certain tasks type called stragglers. Failing to handle stragglers causes delay and affects the overall job execution time. Meanwhile, several straggler reduction techniques have been proposed to improve the MapReduce performance. This study provides a comprehensive and qualitative review of the different existing straggler mitigation solutions. In addition, a taxonomy of the available straggler mitigation solutions is presented. Critical research issues and future research directions are identified and discussed to guide researchers and scholars

Universiti Teknologi Malaysia Institutional Repository