36 research outputs found
Supporting Irregular Distributions in FORTRAN 90D/HPF Compilers
This paper presents methods that make it possible to efficiently support irregular problems using data parallel languages. The approach involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that (1) support static and dynamic distributed array partitioning, (2) partition loop iterations and indirection arrays, (3) remap arrays from one distribution to another, and (4) carry out index translation, buffer allocation and communication schedule generation. The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. This paper also presents performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM. A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. The results suggest that HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions
Supporting Irregular Distributions Using Data-Parallel Languages
Languages such as Fortran D provide irregular distribution schemes that can efficiently support irregular problems. Irregular distributions can also be emulated in HPF. Compilers can incorporate runtime procedures to automatically support these distributions
Code-Optimierung im Polyedermodell - Effizienzsteigerung von parallelen Schleifensätzen
A safe basis for automatic loop parallelization is the polyhedron model which represents the iteration domain of a loop nest as a polyhedron in . However, turning the parallel loop program in the model to efficient code meets with several obstacles, due to which performance may deteriorate seriously -- especially on distributed memory architectures. We introduce a fine-grained model of the computation performed and show how this model can be applied to create efficient code
Emerging trends proceedings of the 17th International Conference on Theorem Proving in Higher Order Logics: TPHOLs 2004
technical reportThis volume constitutes the proceedings of the Emerging Trends track of the 17th International Conference on Theorem Proving in Higher Order Logics (TPHOLs 2004) held September 14-17, 2004 in Park City, Utah, USA. The TPHOLs conference covers all aspects of theorem proving in higher order logics as well as related topics in theorem proving and verification. There were 42 papers submitted to TPHOLs 2004 in the full research cate- gory, each of which was refereed by at least 3 reviewers selected by the program committee. Of these submissions, 21 were accepted for presentation at the con- ference and publication in volume 3223 of Springer?s Lecture Notes in Computer Science series. In keeping with longstanding tradition, TPHOLs 2004 also offered a venue for the presentation of work in progress, where researchers invite discussion by means of a brief introductory talk and then discuss their work at a poster session. The work-in-progress papers are held in this volume, which is published as a 2004 technical report of the School of Computing at the University of Utah
Understanding retargeting compilation techniques for network processors
Mémoire numérisé par la Direction des bibliothèques de l'Université de Montréal
On expressing different concurrency paradigms on virtual execution environment
Virtual execution environments (VEE) such as the Java Virtual Machine (JVM) and the Microsoft Common Language Runtime (CLR) have been designed when the dominant computer architecture featured a Von-Neumann interface to programs: a single processor hiding all the complexity of parallel computations inside its design. Programs are expressed in an intermediate form that is executed by the VEE that defines an abstract computational model in which the concurrency model has been influenced by these design choices and it basically exposes the multi-threading model of the underlying operating system. Recently computer systems have introduced computational units in which concurrency is explicit and under program control. Relevant examples are the Graphical Processing Units (GPU such as Nvidia or AMD) and the Cell BE architecture which allow for explicit control of single processing unit, local memories and communication channels. Unfortunately programs designed for Virtual Machines cannot access to these resources since are not available through the abstractions provided by the VEE. A major redesign of VEEs seems to be necessary in order to bridge this gap. In this thesis we study the problem of exposing non-von Neumann computing resources within the Virtual Machine without need for a redesign of the whole execution infrastructure. In this work we express parallel computations relying on extensible meta-data and reflection to encode information. Meta-programming techniques are then used to rewrite the program into an equivalent one using the special purpose underlying architecture. We provide a case study in which this approach is applied to compiling Common Intermediate Language (CIL) methods to multi-core GPUs; we show that it is possible to access these non-standard computing resources without any change to the virtual machine design
Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation
The increasing complexity of modern hardware requires sophisticated programming
techniques for programs to run efficiently. At the same time, increased power of
modern hardware enables more advanced analyses to be included in compilers. This
thesis focuses on one particular optimisation technique that improves utilisation
of vector units. The foundation of this technique is the ability to chose memory
mappings for data structures of a given program.
Usually programming languages use a fixed layout for logical data structures
in physical memory. Such a static mapping often has a negative effect on usability
of vector units. In this thesis we consider a compiler for a programming language
that allows every data structure in a program to have its own data layout. We
make sure that data layouts across the program are sound, and most importantly
we solve a problem of automatic data layout reconstruction. To consistently do this,
we formulate this as a type inference problem, where type encodes a data layout
for a given structure as well as implied program transformations. We prove that
type-implied transformations preserve semantics of the original programs and we
demonstrate significant performance improvements when targeting SIMD-capable
architectures
Doctor of Philosophy
dissertationIn the static analysis of functional programs, control- ow analysis (k-CFA) is a classic method of approximating program behavior as a infinite state automata. CFA2 and abstract garbage collection are two recent, yet orthogonal improvements, on k-CFA. CFA2 approximates program behavior as a pushdown system, using summarization for the stack. CFA2 can accurately approximate arbitrarily-deep recursive function calls, whereas k-CFA cannot. Abstract garbage collection removes unreachable values from the store/heap. If unreachable values are not removed from a static analysis, they can become reachable again, which pollutes the final analysis and makes it less precise. Unfortunately, as these two techniques were originally formulated, they are incompatible. CFA2's summarization technique for managing the stack obscures the stack such that abstract garbage collection is unable to examine the stack for reachable values. This dissertation presents introspective pushdown control-flow analysis, which manages the stack explicitly through stack changes (pushes and pops). Because this analysis is able to examine the stack by how it has changed, abstract garbage collection is able to examine the stack for reachable values. Thus, introspective pushdown control-flow analysis merges successfully the benefits of CFA2 and abstract garbage collection to create a more precise static analysis. Additionally, the high-performance computing community has viewed functional programming techniques and tools as lacking the efficiency necessary for their applications. Nebo is a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena. For efficient execution, Nebo exploits a version of expression templates, based on the C++ template system, which is a type-less, completely-pure, Turing-complete functional language with burdensome syntax. Nebo's declarative syntax supports functional tools, such as point-wise lifting of complex expressions and functional composition of stencil operators. Nebo's primary abstraction is mathematical assignment, which separates what a calculation does from how that calculation is executed. Currently Nebo supports single-core execution, multicore (thread-based) parallel execution, and GPU execution. With single-core execution, Nebo performs on par with the loops and code that it replaces in Wasatch, a pre-existing high-performance simulation project. With multicore (thread-based) execution, Nebo can linearly scale (with roughly 90% efficiency) up to 6 processors, compared to its single-core execution. Moreover, Nebo's GPU execution can be up to 37x faster than its single-core execution. Finally, Wasatch (the pre-existing high-performance simulation project which uses Nebo) can scale up to 262K cores
Indexed dependence metadata and its applications in software performance optimisation
To achieve continued performance improvements, modern microprocessor design is tending to concentrate
an increasing proportion of hardware on computation units with less automatic management
of data movement and extraction of parallelism. As a result, architectures increasingly include multiple
computation cores and complicated, software-managed memory hierarchies. Compilers have
difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic
generation of efficient code in any but the most straightforward of cases.
We propose the concept of indexed dependence metadata to improve application development and
mapping onto such architectures. The metadata represent both the iteration space of a kernel and the
mapping of that iteration space from a given index to the set of data elements that iteration might
use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping
allows the compiler or runtime to optimise the program more efficiently, and improves the program
structure for the developer. We argue that this form of explicit interface specification reduces the need
for premature, architecture-specific optimisation. It improves program portability, supports intercomponent
optimisation and enables generation of efficient data movement code.
We offer the following contributions: an introduction to the concept of indexed dependence metadata
as a generalisation of stream programming, a demonstration of its advantages in a component
programming system, the decoupled access/execute model for C++ programs, and how indexed dependence
metadata might be used to improve the programming model for GPU-based designs. Our
experimental results with prototype implementations show that indexed dependence metadata supports
automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive
loop fusion optimisations in image processing, linear algebra and multigrid application case
studies