8 research outputs found
Interprocedural Partial Redundancy Elimination and its Application to Distributed Memory Compilation
Partial Redundancy Elimination (PRE) is a general scheme for suppressing
partial redundancies which encompasses traditional optimizations like loop
invariant code motion and redundant code elimination. In this paper we
address the problem of performing this optimization interprocedurally. We
use interprocedural partial redundancy elimination for placement of
communication and communication preprocessing statements while compiling
for distributed memory parallel machines.
(Also cross-referenced as UMIACS-TR-95-42
A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra
Reducing communication overhead is extremely important in distributed-memory message-passing architectures. In this paper, we present a technique to improve communication that considers data access patterns of the entire program. Our approach is based on a combination of traditional data-flow analysis and a linear algebra framework, and works on structured programs with conditional statements and nested loops but without arbitrary goto statements. The distinctive features of the solution are the accuracy in keeping communication set information, support for general alignments and distributions including block-cyclic distributions and the ability to simulate some of the previous approaches with suitable modifications. We also show how optimizations such as message vectorization, message coalescing and redundancy elimination are supported by our framework. Experimental results on several benchmarks show that our technique is effective in reducing the number of messages (an average of 32% reduction), the volume of the data communicated (an average of 37% reduction), and the execution time (an average of 26% reduction)
Contention elimination by replication of sequential sections in distributed shared memory programs
In shared memory programs contention often occurs at the transition between a sequential and a parallel section of the code. As all threads start executing the parallel section, they often access data just modified by the thread that executed the sequential section, causing a flurry of data requests to converge on that processor.We address this problem in a software distributed shared memory system by replicating the execution of the sequential sections on all processors. Communication during this replicated sequential execution is reduced by using multicast.We have implemented replicated sequential execution with multicast support in OpenMP/NOW, a version of of OpenMP that runs on networks of workstations. We do not rely on compile-time data analysis, and therefore we can handle irregular and pointer-based applications. We show significant improvement for two pointer-based applications that suffer from severe contention without replicated sequential execution
Combining Compile-Time and Run-Time Support for Efficient Distributed Shared Memory
We describe an integrated compile-time and run-time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model, with its well-known benefits in terms of ease of use. The run-time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the run-time system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses, and transforms the code to insert calls to the run-time system that provide it with the access information computed by the compiler. The run-time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run-time consistency maintenance. In those cases where the compiler analysis succeeds for the entire program, we demonstrate that the combined system achieves performance comparable to that produced by compilers that directly target message passing. If the compiler analysis is successful only for parts of the program, for instance, because of irregular accesses to some of the arrays, the resulting optimizations can be applied to those parts for which the analysis succeeds. If the compiler analysis fails entirely, we rely on the run-time’s maintenance of shared memory, and thereby avoid the complexity and the limitations of compilers that directly target message passing. The result is a single system that combines efficient support for both regular and irregular memory access patterns
An Interprocedural Framework for Placement of Asychronous I/O Operations
Overlapping memory accesses with computations is a standard
technique for improving performance on modern architectures, which have
deep memory hierarchies. In this paper, we present a compiler technique
for overlapping accesses to secondary memory (disks) with computation. We
have developed an Interprocedural Balanced Code Placement (IBCP)
framework, which performs analysis on arbitrary recursive procedures and
arbitrary control flow and replaces synchronous I/O operations with a
balanced pair of asynchronous operations. We demonstrate how this
analysis is useful for applications which perform frequent and large
accesses to secondary memory, including applications which snapshot or
checkpoint their computations or out-of-core applications.
(Also cross-referenced as UMIACS-TR-95-114
Interprocedural Compilation of Irregular Applications for Distributed Memory Machines
Data parallel languages like High Performance Fortran (HPF) are emerging
as the architecture independent mode of programming distributed memory
parallel machines. In this paper, we present the interprocedural
optimizations required for compiling applications having irregular data
access patterns, when coded in such data parallel languages. We have
developed an Interprocedural Partial Redundancy Elimination (IPRE)
algorithm for optimized placement of runtime preprocessing routine and
collective communication routines inserted for managing communication in
such codes. We also present three new interprocedural optimizations:
placement of scatter routines, deletion of data structures and use of
coalescing and incremental routines. We then describe how program slicing
can be used for further applying IPRE in more complex scenarios. We have
done a preliminary implementation of the schemes presented here using the
Fortran D compilation system as the necessary infrastructure. We present
experimental results from two codes compiled using our system to
demonstrate the efficacy of the presented schemes.
(Also cross-referenced as UMIACS-TR-95-43
Give-N-Take -- A Balanced Code Placement Framework
GIVE-N-TAKE is a code placement framework which uses a general producer-consumer concept. An advantage of GIVE-N-TAKE over existing partial redundancy elimination techniques is its concept of production regions, instead of single locations, which can be beneficial for general latency hiding. GIVE-N-TaKE guarantees balanced production, that is, each production will be started and stopped once. The framework can also take advantage of production coming "for free," as induced by side effects, without disturbing balance. GIVE-N-TAKE can place production either before or after consumption, and it also provides the option to hoist code out of potentially zero-trip loop (nest) con- structs. GIVE-N-TAKE uses a fast elimination method based on Tarjan intervals, with a complexity linear in the program size in most cases. We hav