45 research outputs found
Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Computers
Advanced Research Projects Agency (ARPA)National Aeronautics and Space AdministrationOpe
Automatic Selection of Dynamic Data Partitioning Schemes for Distributed Memory Multicomputers
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Advanced Research Projects Agency (ARPA) / DAA-H04-94-G-027
Recommended from our members
Double standards: bringing task parallelism to HPF via the message passing interface
High Performance Fortran (HPF) does not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how a coordination library implementing the Message Passing Interface (MPI) can be used to represent these common parallel program structures. This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. WE conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology
Task-based Runtime Optimizations Towards High Performance Computing Applications
The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity.
In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
Memory-efficient array redistribution through portable collective communication
Modern large-scale deep learning workloads highlight the need for parallel
execution across many devices in order to fit model data into hardware
accelerator memories. In these settings, array redistribution may be required
during a computation, but can also become a bottleneck if not done efficiently.
In this paper we address the problem of redistributing multi-dimensional array
data in SPMD computations, the most prevalent form of parallelism in deep
learning. We present a type-directed approach to synthesizing array
redistributions as sequences of MPI-style collective operations. We prove
formally that our synthesized redistributions are memory-efficient and perform
no excessive data transfers. Array redistribution for SPMD computations using
collective operations has also been implemented in the context of the XLA SPMD
partitioner, a production-grade tool for partitioning programs across
accelerator systems. We evaluate our approach against the XLA implementation
and find that our approach delivers a geometric mean speedup of ,
with maximum speedups as a high as , while offering provable memory
guarantees, making our system particularly appealing for large-scale models.Comment: minor errata fixe
Recommended from our members
High-performance data-parallel input/output
Existing parallel file systems are proving inadequate in two important arenas:
programmability and performance. Both of these inadequacies can largely be traced
to the fact that nearly all parallel file systems evolved from Unix and rely on a Unix-oriented,
single-stream, block-at-a-time approach to file I/O. This one-size-fits-all
approach to parallel file systems is inadequate for supporting applications running
on distributed-memory parallel computers.
This research provides a migration path away from the traditional approaches
to parallel I/O at two levels. At the level seen by the programmer, we show how
file operations can be closely integrated with the semantics of a parallel language.
Principles for this integration are illustrated in their application to C*, a virtual-processor-
oriented language. The result is that traditional C file operations with
familiar semantics can be used in C* where the programmer works--at the virtual
processor level. To facilitate high performance within this framework, machine-independent
modes are used. Modes change the performance of file operations,
not their semantics, so programmers need not use ambiguous operations found in
many parallel file systems. An automatic mode detection technique is presented
that saves the programmer from extra syntax and low-level file system details. This
mode detection system ensures that the most commonly encountered file operations
are performed using high-performance modes.
While the high-performance modes allow fast collective movement of file data,
they must include optimizations for redistribution of file data, a common operation
in production scientific code. This need is addressed at the file system level, where
we provide enhancements to Disk-Directed I/O for redistributing file data. Two
enhancements are geared to speeding fine-grained redistributions. One uses a two-phase,
or indirect, approach to redistributing data among compute nodes. The
other relies on I/O nodes to guide the redistribution by building packets bound for
compute nodes. We model the performance of these enhancements and determine
the key parameters determining when each approach should be used. Finally, we
introduce the notion of collective prefetching and identify its performance benefits
and implementation tradeoffs
Automatic Data and Computation Mapping for Distributed-Memory Machines.
Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor\u27s memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes