37,314 research outputs found
Near-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential
models that best explain data. Scaling these approaches to massive data sizes
requires proper distributed gradient descent schemes where distributed worker
nodes compute partial gradients based on their partial and local data sets, and
send the results to a master node where all the computations are aggregated
into a full gradient and the learning model is updated. However, a major
performance bottleneck that arises is that some of the worker nodes may run
slow. These nodes a.k.a. stragglers can significantly slow down computation as
the slowest node may dictate the overall computational time. We propose a
distributed computing scheme, called Batched Coupon's Collector (BCC) to
alleviate the effect of stragglers in gradient methods. We prove that our BCC
scheme is robust to a near optimal number of random stragglers. We also
empirically demonstrate that our proposed BCC scheme reduces the run-time by up
to 85.4% over Amazon EC2 clusters when compared with other straggler mitigation
strategies. We also generalize the proposed BCC scheme to minimize the
completion time when implementing gradient descent-based algorithms over
heterogeneous worker nodes
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting
the execution of distributed applications on heterogeneous hardware. While
TensorFlow has been initially designed for developing Machine Learning (ML)
applications, in fact TensorFlow aims at supporting the development of a much
broader range of application kinds that are outside the ML domain and can
possibly include HPC applications. However, very few experiments have been
conducted to evaluate TensorFlow performance when running HPC workloads on
supercomputers. This work addresses this lack by designing four traditional HPC
benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG)
solver and Fast Fourier Transform (FFT). We analyze their performance on two
supercomputers with accelerators and evaluate the potential of TensorFlow for
developing HPC applications. Our tests show that TensorFlow can fully take
advantage of high performance networks and accelerators on supercomputers.
Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical
communication bandwidth on our testing platform. We find an approximately 2x,
1.7x and 1.8x performance improvement when increasing the number of GPUs from
two to four in the matrix-matrix multiply, CG and FFT applications
respectively. All our performance results demonstrate that TensorFlow has high
potential of emerging also as HPC programming framework for heterogeneous
supercomputers.Comment: Accepted for publication at The Ninth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES'19
Revisiting Matrix Product on Master-Worker Platforms
This paper is aimed at designing efficient parallel matrix-product algorithms
for heterogeneous master-worker platforms. While matrix-product is
well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm
and ScaLAPACK outer product algorithm), there are three key hypotheses that
render our work original and innovative:
- Centralized data. We assume that all matrix files originate from, and must
be returned to, the master.
- Heterogeneous star-shaped platforms. We target fully heterogeneous
platforms, where computational resources have different computing powers.
- Limited memory. Because we investigate the parallelization of large
problems, we cannot assume that full matrix panels can be stored in the worker
memories and re-used for subsequent updates (as in ScaLAPACK).
We have devised efficient algorithms for resource selection (deciding which
workers to enroll) and communication ordering (both for input and result
messages), and we report a set of numerical experiments on various platforms at
Ecole Normale Superieure de Lyon and the University of Tennessee. However, we
point out that in this first version of the report, experiments are limited to
homogeneous platforms
Latency Analysis of Coded Computation Schemes over Wireless Networks
Large-scale distributed computing systems face two major bottlenecks that
limit their scalability: straggler delay caused by the variability of
computation times at different worker nodes and communication bottlenecks
caused by shuffling data across many nodes in the network. Recently, it has
been shown that codes can provide significant gains in overcoming these
bottlenecks. In particular, optimal coding schemes for minimizing latency in
distributed computation of linear functions and mitigating the effect of
stragglers was proposed for a wired network, where the workers can
simultaneously transmit messages to a master node without interference. In this
paper, we focus on the problem of coded computation over a wireless
master-worker setup with straggling workers, where only one worker can transmit
the result of its local computation back to the master at a time. We consider 3
asymptotic regimes (determined by how the communication and computation times
are scaled with the number of workers) and precisely characterize the total
run-time of the distributed algorithm and optimum coding strategy in each
regime. In particular, for the regime of practical interest where the
computation and communication times of the distributed computing algorithm are
comparable, we show that the total run-time approaches a simple lower bound
that decouples computation and communication, and demonstrate that coded
schemes are times faster than uncoded schemes
- âŠ