391 research outputs found
Interprocedural Compilation of Irregular Applications for Distributed Memory Machines
Data parallel languages like High Performance Fortran (HPF) are emerging
as the architecture independent mode of programming distributed memory
parallel machines. In this paper, we present the interprocedural
optimizations required for compiling applications having irregular data
access patterns, when coded in such data parallel languages. We have
developed an Interprocedural Partial Redundancy Elimination (IPRE)
algorithm for optimized placement of runtime preprocessing routine and
collective communication routines inserted for managing communication in
such codes. We also present three new interprocedural optimizations:
placement of scatter routines, deletion of data structures and use of
coalescing and incremental routines. We then describe how program slicing
can be used for further applying IPRE in more complex scenarios. We have
done a preliminary implementation of the schemes presented here using the
Fortran D compilation system as the necessary infrastructure. We present
experimental results from two codes compiled using our system to
demonstrate the efficacy of the presented schemes.
(Also cross-referenced as UMIACS-TR-95-43
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
Optimal Compilation of HPF Remappings
International audienceApplications with varying array access patterns require to dynamically change array mappings on distributed-memory parallel machines. HPF (High Performance Fortran) provides such remappings, on data that can be replicated, explicitly through therealign andredistribute directives and implicitly at procedure calls and returns. However such features are left out of the HPF subset or of the currently discussed hpf kernel for effeciency reasons. This paper presents a new compilation technique to handle hpf remappings for message-passing parallel architectures. The first phase is global and removes all useless remappings that appear naturally in procedures. The code generated by the second phase takes advantage of replications to shorten the remapping time. It is proved optimal: A minimal number of messages, containing only the required data, is sent over the network. The technique is fully implemented in HPFC, our prototype HPF compiler. Experiments were performed on a Dec Alpha farm
PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS
6 pagesInternational audienceParallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequential sources. Polyhedral models are becoming more popular because of their combination of expressiveness, compactness, and accurate abstraction of the data-parallel behaviour of programs. These models provide automatic or semi-automatic parallelization and code transformation capabilities that target such modern parallel architectures. PIPS is a quarter-century old source-to-source transformation framework that initially targeted parallel machines but then evolved to include other targets. PIPS uses abstract interpretation on an integer polyhedral lattice to represent program code, allowing linear relation analysis on integer variables in an interprocedural way. The same representation is used for the dependence test and the convex array region analysis. The polyhedral model is also more classically used to schedule code from linear constraints. In this paper, we illustrate the features of this compiler infrastructure on an hypothetical input code, demonstrating the combination of polyhedral and non polyhedral transformations. PIPS interprocedural polyhedral analyses are used to generate data transfers and are combined with non-polyhedral transformations to achieve efficient CUDA code generation
Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Computers
Advanced Research Projects Agency (ARPA)National Aeronautics and Space AdministrationOpe
Data locality and parallelism optimization using a constraint-based approach
Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In
the context of data-intensive embedded applications, there have been two complementary approaches to
enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.
Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher
levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require
a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip
memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This
can be achieved by considering multiple loop nests simultaneously. Although compilers address these two
problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.
Therefore, an integrated approach that combines these two can generate much better results than each
individual approach. Based on these observations, this paper proposes a constraint network (CN)-based
formulation for data locality optimization and code parallelization. The paper also presents experimental
evidence, demonstrating the success of the proposed approach, and compares our results with those
obtained through previously proposed approaches. The experiments from our implementation indicate
that the proposed approach is very effective in enhancing data locality and parallelization.
© 2010 Elsevier Inc. All rights reserved
Interprocedural Partial Redundancy Elimination and its Application to Distributed Memory Compilation
Partial Redundancy Elimination (PRE) is a general scheme for suppressing
partial redundancies which encompasses traditional optimizations like loop
invariant code motion and redundant code elimination. In this paper we
address the problem of performing this optimization interprocedurally. We
use interprocedural partial redundancy elimination for placement of
communication and communication preprocessing statements while compiling
for distributed memory parallel machines.
(Also cross-referenced as UMIACS-TR-95-42
An Interprocedural Framework for Placement of Asychronous I/O Operations
Overlapping memory accesses with computations is a standard
technique for improving performance on modern architectures, which have
deep memory hierarchies. In this paper, we present a compiler technique
for overlapping accesses to secondary memory (disks) with computation. We
have developed an Interprocedural Balanced Code Placement (IBCP)
framework, which performs analysis on arbitrary recursive procedures and
arbitrary control flow and replaces synchronous I/O operations with a
balanced pair of asynchronous operations. We demonstrate how this
analysis is useful for applications which perform frequent and large
accesses to secondary memory, including applications which snapshot or
checkpoint their computations or out-of-core applications.
(Also cross-referenced as UMIACS-TR-95-114
Efficient Machine-Independent Programming of High-Performance Multiprocessors
Parallel computing is regarded by most computer scientists as the most
likely approach for significantly improving computing power for scientists
and engineers. Advances in programming languages and parallelizing
compilers are making parallel computers easier to use by providing
a high-level portable programming model that protects software
investment. However, experience has shown that simply finding
parallelism is not always sufficient for obtaining good performance
from today's multiprocessors. The goal of this project is to develop
advanced compiler analysis of data and computation decompositions,
thread placement, communication, synchronization, and memory system
effects needed in order to take advantage of performance-critical
elements in modern parallel architectures
- …