36 research outputs found
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance
Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations
Asynchronous programming models (APM) are gaining more and more traction,
allowing applications to expose the available concurrency to a runtime system
tasked with coordinating the execution. While MPI has long provided support for
multi-threaded communication and non-blocking operations, it falls short of
adequately supporting APMs as correctly and efficiently handling MPI
communication in different models is still a challenge. Meanwhile, new
low-level implementations of light-weight, cooperatively scheduled execution
contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis
for higher-level APMs and their integration in MPI implementations has been
proposed as a replacement for traditional POSIX thread support to alleviate
these challenges.
In this paper, we first establish a taxonomy in an attempt to clearly
distinguish different concepts in the parallel software stack. We argue that
the proposed tight integration of fiber implementations with MPI is neither
warranted nor beneficial and instead is detrimental to the goal of MPI being a
portable communication abstraction. We propose MPI Continuations as an
extension to the MPI standard to provide callback-based notifications on
completed operations, leading to a clear separation of concerns by providing a
loose coupling mechanism between MPI and APMs. We show that this interface is
flexible and interacts well with different APMs, namely OpenMP detached tasks,
OmpSs-2, and Argobots.Comment: 12 pages, 7 figures Published in proceedings of EuroMPI/USA '20,
September 21-24, 2020, Austin, TX, US
Correlated Set Coordination in Fault Tolerant Message Logging Protocols
Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic as-sumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
Several recovery techniques for parallel iterative methods are presented.
First, the implementation of checkpoints in parallel iterative
methods is described and analyzed. Then, a simple checkpoint-free fault tolerant
scheme for parallel iterative methods, the lossy approach, is presented.
When one processor fails and all its data is lost, the system is
recovered by computing a new approximate solution using the data of the
non-failed processors. The iterative method is then restarted with this
new vector. The main advantage of the lossy approach over standard
checkpoint algorithms is that it does not increase the computational cost
of the iterative solver, when no failure occurs. Experiments are presented
that compare the different techniques. The fault tolerant FT-MPI library
is used. Both iterative linear solvers and eigensolvers are considered
A Current Task-Based Programming Paradigms Analysis
International audienceTask-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between them. Then, the runtime can schedule the tasks and data migrations efficiently over all the available cores while reducing the waiting time between tasks. This paper focus on comparing several task-based programming models between themselves using the LU factorization as benchmark. HPX, PaRSEC, Legion and YML+XMP are task-based programming models which schedule data movement and computational tasks on distributed resources allocated to the application. YML+XMP supports parallel and distributed tasks with XscalableMP, a PGAS language. We compared their performances and scalability are compared to ScaLA-PACK, an highly optimized library which uses MPI to perform communications between the processes on up to 64 nodes. We performed a block-based LU factorization with the task-based programming model on up to a matrix of size 49512 × 49512. HPX is performing better than PaRSEC, Legion and YML+XMP but not better than ScaLAPACK. YML+XMP has a better scalability than HPX, Legion and PaRSEC. Regent has trouble scaling from 32 nodes to 64 nodes with our algorithm
Innovative Computing Laboratory,
Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite.
Analysis of the component architecture overhead
Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite.