64 research outputs found
Latency Analysis of Coded Computation Schemes over Wireless Networks
Large-scale distributed computing systems face two major bottlenecks that
limit their scalability: straggler delay caused by the variability of
computation times at different worker nodes and communication bottlenecks
caused by shuffling data across many nodes in the network. Recently, it has
been shown that codes can provide significant gains in overcoming these
bottlenecks. In particular, optimal coding schemes for minimizing latency in
distributed computation of linear functions and mitigating the effect of
stragglers was proposed for a wired network, where the workers can
simultaneously transmit messages to a master node without interference. In this
paper, we focus on the problem of coded computation over a wireless
master-worker setup with straggling workers, where only one worker can transmit
the result of its local computation back to the master at a time. We consider 3
asymptotic regimes (determined by how the communication and computation times
are scaled with the number of workers) and precisely characterize the total
run-time of the distributed algorithm and optimum coding strategy in each
regime. In particular, for the regime of practical interest where the
computation and communication times of the distributed computing algorithm are
comparable, we show that the total run-time approaches a simple lower bound
that decouples computation and communication, and demonstrate that coded
schemes are times faster than uncoded schemes
Near-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential
models that best explain data. Scaling these approaches to massive data sizes
requires proper distributed gradient descent schemes where distributed worker
nodes compute partial gradients based on their partial and local data sets, and
send the results to a master node where all the computations are aggregated
into a full gradient and the learning model is updated. However, a major
performance bottleneck that arises is that some of the worker nodes may run
slow. These nodes a.k.a. stragglers can significantly slow down computation as
the slowest node may dictate the overall computational time. We propose a
distributed computing scheme, called Batched Coupon's Collector (BCC) to
alleviate the effect of stragglers in gradient methods. We prove that our BCC
scheme is robust to a near optimal number of random stragglers. We also
empirically demonstrate that our proposed BCC scheme reduces the run-time by up
to 85.4% over Amazon EC2 clusters when compared with other straggler mitigation
strategies. We also generalize the proposed BCC scheme to minimize the
completion time when implementing gradient descent-based algorithms over
heterogeneous worker nodes
Coded Computation Against Processing Delays for Virtualized Cloud-Based Channel Decoding
The uplink of a cloud radio access network architecture is studied in which
decoding at the cloud takes place via network function virtualization on
commercial off-the-shelf servers. In order to mitigate the impact of straggling
decoders in this platform, a novel coding strategy is proposed, whereby the
cloud re-encodes the received frames via a linear code before distributing them
to the decoding processors. Transmission of a single frame is considered first,
and upper bounds on the resulting frame unavailability probability as a
function of the decoding latency are derived by assuming a binary symmetric
channel for uplink communications. Then, the analysis is extended to account
for random frame arrival times. In this case, the trade-off between average
decoding latency and the frame error rate is studied for two different queuing
policies, whereby the servers carry out per-frame decoding or continuous
decoding, respectively. Numerical examples demonstrate that the bounds are
useful tools for code design and that coding is instrumental in obtaining a
desirable compromise between decoding latency and reliability.Comment: 11 pages and 12 figures, Submitte
- …