26 research outputs found
Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Computers
Advanced Research Projects Agency (ARPA)National Aeronautics and Space AdministrationOpe
Compilation techniques for multicomputers
This thesis considers problems in process and data partitioning when compiling
programs for distributed-memory parallel computers (or multicomputers). These
partitions may be specified by the user through the use of language constructs,
or automatically determined by the compiler.
Data and process partitioning techniques are developed for two models of
compilation. The first compilation model focusses on the loop nests present in a
serial program. Executing the iterations of these loop nests in parallel accounts for
a significant amount of the parallelism which can be exploited in these programs.
The parallelism is exploited by applying a set of transformations to the loop
nests. The iterations of the transformed loop nests are in a form which can be
readily distributed amongst the processors of a multicomputer. The manner in
which the arrays, referenced within these loop nests, are partitioned between the
processors is determined by the distribution of the loop iterations. The second
compilation model is based on the data parallel paradigm, in which operations
are applied to many different data items collectively. High Performance Fortran
is used as an example of this paradigm.
Novel collective communication routines are developed, and are applied to
provide the communication associated with the data partitions for both compilation
models. Furthermore, it is shown that by using these routines the
communication associated with partitioning data on a multicomputer is greatly
simplified. These routines are developed as part of this thesis.
The experimental context for this thesis is the development of a compiler for
the Fujitsu AP1000 multicomputer. A prototype compiler is presented. Experimental
results for a variety of applications are included
Compiling Fortran 90D/HPF for distributed memory MIMD computers
This paper describes the design of the Fortran90D/HPF compiler, a source-to-source parallel compiler for distributed memory systems being developed at Syracuse University. Fortran 90D/HPF is a data parallel language with special directives to specify data alignment and distributions. A systematic methodology to process distribution directives of Fortran 90D/HPF is presented. Furthermore, techniques for data and computation partitioning, communication detection and generation, and the run-time support for the compiler are discussed. Finally, initial performance results for the compiler are presented. We believe that the methodology to process data distribution, computation partitioning, communication system design and the overall compiler design can be used by the implementors of compilers for HPF
On Extracting Course-Grained Function Parallelism from C Programs
To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages.
This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future.
This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations.
We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future
Automatic Data and Computation Mapping for Distributed-Memory Machines.
Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor\u27s memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes
Automated parallel application creation and execution tool for clusters
This research investigated an automated approach to re-writing traditional sequential computer programs into parallel programs for networked computers. A tool was designed and developed for generating parallel programs automatically and also executing these parallel programs on a network of computers. Performance is maximized by utilising all idle resources
The projector algorithm: a simple parallel algorithm for computing Voronoi diagrams and Delaunay graphs
The Voronoi diagram is a certain geometric data structure which has numerous
applications in various scientific and technological fields. The theory of
algorithms for computing 2D Euclidean Voronoi diagrams of point sites is rich
and useful, with several different and important algorithms. However, this
theory has been quite steady during the last few decades in the sense that no
essentially new algorithms have entered the game. In addition, most of the
known algorithms are serial in nature and hence cast inherent difficulties on
the possibility to compute the diagram in parallel. In this paper we present
the projector algorithm: a new and simple algorithm which enables the
(combinatorial) computation of 2D Voronoi diagrams. The algorithm is
significantly different from previous ones and some of the involved concepts in
it are in the spirit of linear programming and optics. Parallel implementation
is naturally supported since each Voronoi cell can be computed independently of
the other cells. A new combinatorial structure for representing the cells (and
any convex polytope) is described along the way and the computation of the
induced Delaunay graph is obtained almost automatically.Comment: This is a major revision; re-organization and better presentation of
some parts; correction of several inaccuracies; improvement of some proofs
and figures; added references; modification of the title; the paper is long
but more than half of it is composed of proofs and references: it is
sufficient to look at pages 5, 7--11 in order to understand the algorith
Recommended from our members
Strategies and tools for the exploitation of massively parallel computer systems
The aim of this thesis is to develop software and strategies for the exploitation of parallel computer hardware, in particular distributed memory systems, and embedding these strategies within a parallelisation tool to allow the automatic generation of these strategies.
The parallelisation of four structured mesh codes using the Computer Aided Parallelisation Tools provided a good initial parallelisation of the codes. However, investigation revealed that simple optimisation of the communications within these codes provided an even better improvement in performance. The dominant factor within the communications was the data transfer time with communication start-up latencies also significant. This was significant throughout the codes but especially in sections of pipelined code where there were large amounts of communication present.
This thesis describes the development and testing of the methods used to increase the performance of these communications by overlapping them with unrelated calculation. This method of overlapping the communications was applied to the exchange of data communications as well as the pipelined communications.
The successful application by hand provided the motivation for these methods to be incorporated and automatically generated within the Computer Aided Parallelisation Tools. These methods were integrated within these tools as an additional stage of the parallelisation. This required a generic algorithm that made use of many of the symbolic algebra tests and symbolic variable manipulation routines within the tools.
The automatic generation of overlapped communications was applied to the four codes previously parallelised as well as a further three codes, one of which was a real world Computational Fluid Dynamics code.
The methods to apply automatic generation of overlapped communications to unstructured mesh codes were also discussed. These methods are similar to those applied to the structured mesh codes and their automation is viewed to be of a similar fashion