4,336 research outputs found
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures
The QR factorization and the SVD are two fundamental matrix decompositions
with applications throughout scientific computing and data analysis. For
matrices with many more rows than columns, so-called "tall-and-skinny
matrices," there is a numerically stable, efficient, communication-avoiding
algorithm for computing the QR factorization. It has been used in traditional
high performance computing and grid computing environments. For MapReduce
environments, existing methods to compute the QR decomposition use a
numerically unstable approach that relies on indirectly computing the Q factor.
In the best case, these methods require only two passes over the data. In this
paper, we describe how to compute a stable tall-and-skinny QR factorization on
a MapReduce architecture in only slightly more than 2 passes over the data. We
can compute the SVD with only a small change and no difference in performance.
We present a performance comparison between our new direct TSQR method, a
standard unstable implementation for MapReduce (Cholesky QR), and the classic
stable algorithm implemented for MapReduce (Householder QR). We find that our
new stable method has a large performance advantage over the Householder QR
method. This holds both in a theoretical performance model as well as in an
actual implementation
- β¦