251 research outputs found
Run-time Support for Parallelization of Data-Parallel Applications on Adaptive and Nonuniform Computational Environments
In this paper we discuss the runtime support required for the parallelization of unstructured data parallel applications on nonuniform and adaptive environments. The approach presented is reasonably general and is applicable to a wide variety of regular as well as irregular applications. We present performance results for the solution of an unstructured mesh on a cluster of heterogeneous workstations
Integrating Algorithmic and Systemic Load Balancing Strategies in Parallel Scientific Applications
Load imbalance is a major source of performance degradation in parallel scientific applications. Load balancing increases the efficient use of existing resources and improves performance of parallel applications running in distributed environments. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. At a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Algorithmic and systemic load balancing strategies have complementary set of advantages. An integration of these two techniques is possible and it should result in a system, which delivers advantages over each technique used in isolation. This thesis presents a design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy called Fractiling with a systemic coarse-grained task-parallel load balancing system called Hector. It also reports on experimental results of running N-body simulations under this integrated system. The experimental results indicate that a distributed runtime environment, which combines both algorithmic and systemic load balancing strategies, can provide performance advantages with little overhead, underscoring the importance of this approach in large complex scientific applications
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
SpECTRE: A Task-based Discontinuous Galerkin Code for Relativistic Astrophysics
We introduce a new relativistic astrophysics code, SpECTRE, that combines a
discontinuous Galerkin method with a task-based parallelism model. SpECTRE's
goal is to achieve more accurate solutions for challenging relativistic
astrophysics problems such as core-collapse supernovae and binary neutron star
mergers. The robustness of the discontinuous Galerkin method allows for the use
of high-resolution shock capturing methods in regions where (relativistic)
shocks are found, while exploiting high-order accuracy in smooth regions. A
task-based parallelism model allows efficient use of the largest supercomputers
for problems with a heterogeneous workload over disparate spatial and temporal
scales. We argue that the locality and algorithmic structure of discontinuous
Galerkin methods will exhibit good scalability within a task-based parallelism
framework. We demonstrate the code on a wide variety of challenging benchmark
problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the
code's scalability including its strong scaling on the NCSA Blue Waters
supercomputer up to the machine's full capacity of 22,380 nodes using 671,400
threads.Comment: 41 pages, 13 figures, and 7 tables. Ancillary data contains
simulation input file
Run-time and compile-time support for adaptive irregular problems
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions
Performance and Memory Space Optimizations for Embedded Systems
Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed.
The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are:
We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code.
We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem.
We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times.
We present an SPM management technique using Markov chain based data access.
We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations
Distributed Memory Compiler Methods for Irregular Problems -- Data Copy Reuse and Runtime Partitioning
This paper outlines two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on an iPSC/860 to demonstrate the usefulness of our methods
Run-time and Compile-time Support for Adaptive Irregular Problems
In adaptive irregular problems
the data arrays are accessed via indirection arrays,
and data access patterns change during
computation. Implementing such problems on distributed memory
machines requires support for dynamic data partitioning,
efficient preprocessing and fast data migration.
This research presents efficient runtime primitives for
such problems. This new set of primitives is part of the
CHAOS library. It subsumes the previous PARTI library which
targeted only static irregular problems.
To demonstrate the efficacy of the runtime support,
two real adaptive irregular
applications have been parallelized using CHAOS primitives:
a molecular dynamics code (CHARMM)
and a particle-in-cell code (DSMC).
The paper also proposes extensions to Fortran D
which can allow compilers to
generate more efficient code for adaptive problems.
These language extensions have been implemented
in the Syracuse Fortran 90D/HPF prototype compiler.
The performance of the compiler parallelized codes
is compared with the hand parallelized versions.
(Also cross-referenced as UMIACS-TR-94-55
- …