113 research outputs found
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
Design and implementation of an array language for computational science on a heterogeneous multicore architecture
The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufacture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase.
Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high processor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread context switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming.
This dissertation investigates the application of a programming language supporting the first-class use of arrays; and capable of automatically parallelising array expressions; to the heterogeneous multicore domain of the CBE, as found in the Sony PlayStation 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E , is developed to support this aim, and written in the Haskell programming language.
The output of the compiler is in an extended C++ dialect known as Offload C++, which targets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory locality are introduced.
A suite of medium-sized (100-700 lines), real-world benchmark programs are used to evaluate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs.
The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high performance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs
A metadata-enhanced framework for high performance visual effects
This thesis is devoted to reducing the interactive latency of image processing computations in
visual effects. Film and television graphic artists depend upon low-latency feedback to receive
a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising
compiler which leverages high-level program metadata to guide key computational and
memory hierarchy optimisations. This metadata encodes static and dynamic information about
data dependence and patterns of memory access in the algorithms constituting a visual effect –
features that are typically difficult to extract through program analysis – and presents it to the
compiler in an explicit form. By using domain-specific information as a substitute for program
analysis, our compiler is able to target a set of complex source-level optimisations that a vendor
compiler does not attempt, before passing the optimised source to the vendor compiler for
lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of
space and schedule optimisation – based upon well-known compositions of the loop fusion and
array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised
visual effect. This adaptation sidesteps the costly solution of runtime code generation
by specialising static parameters in an offline process and exploiting dynamic metadata to
adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second
optimisation comprises a set of transformations to generate SIMD ISA-augmented source code.
Our approach differs from autovectorisation by using static metadata to identify parallelism, in
place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable
parameters for optimal aligned memory access. The third optimisation comprises a related set
of transformations to generate code for SIMT architectures, such as GPUs. Static dependence
metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads.
Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying
inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed
visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD
multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations
of these two effects. Programmability is enhanced by automating the generation of
SIMD and SIMT implementations from a single programmer-managed scalar representation
A parallel functional language compiler for message-passing multicomputers
The research presented in this thesis is about the design and implementation of Naira, a parallel, parallelising compiler for a rich, purely functional programming language. The source language of the compiler is a subset of Haskell 1.2. The front end of Naira is written entirely in the Haskell subset being compiled. Naira has been successfully parallelised and it is the largest successfully parallelised Haskell program having achieved good absolute speedups on a network of SUN workstations. Having the same basic structure as other production compilers of functional languages, Naira's parallelisation technology should carry forward to other functional language compilers. The back end of Naira is written in C and generates parallel code in the C language which is envisioned to be run on distributed-memory machines. The code generator is based on a novel compilation scheme specified using a restricted form of Milner's 7r-calculus which achieves asynchronous communication. We present the first working implementation of this scheme on distributed-memory message-passing multicomputers with split-phase transactions. Simulated assessment of the generated parallel code indicates good parallel behaviour. Parallelism is introduced using explicit, advisory user annotations in the source' program and there are two major aspects of the use of annotations in the compiler. First, the front end of the compiler is parallelised so as to improve its efficiency at compilation time when it is compiling input programs. Secondly, the input programs to the compiler can themselves contain annotations based on which the compiler generates the multi-threaded parallel code. These, therefore, make Naira, unusually and uniquely, both a parallel and a parallelising compiler. We adopt a medium-grained approach to granularity where function applications form the unit of parallelism and load distribution. We have experimented with two different task distribution strategies, deterministic and random, and have also experimented with thread-based and quantum- based scheduling policies. Our experiments show that there is little efficiency difference for regular programs but the quantum-based scheduler is the best in programs with irregular parallelism. The compiler has been successfully built, parallelised and assessed using both idealised and realistic measurement tools: we obtained significant compilation speed-ups on a variety of simulated parallel architectures. The simulated results are supported by the best results obtained on real hardware for such a large program: we measured an absolute speedup of 2.5 on a network of 5 SUN workstations. The compiler has also been shown to have good parallelising potential, based on popular test programs. Results of assessing Naira's generated unoptimised parallel code are comparable to those produced by other successful parallel implementation projects
Natural Language Processing
The subject of Natural Language Processing can be considered in both broad and narrow senses. In the broad sense, it covers processing issues at all levels of natural language understanding, including speech recognition, syntactic and semantic analysis of sentences, reference to the discourse context (including anaphora, inference of referents, and more extended relations of discourse coherence and narrative structure), conversational inference and implicature, and discourse planning and generation. In the narrower sense, it covers the syntactic and semantic processing sentences to deliver semantic objects suitable for referring, inferring, and the like. Of course, the results of inference and reference may under some circumstances play a part in processing in the narrow sense. But the processes that are characteristic of these other modules are not the primary concern
Granularity in Large-Scale Parallel Functional Programming
This thesis demonstrates how to reduce the runtime of large non-strict functional programs using parallel evaluation. The parallelisation of several programs shows the importance of granularity, i.e. the computation costs of program expressions. The aspect of granularity is studied both on a practical level, by presenting and measuring runtime granularity improvement mechanisms, and at a more formal level, by devising a static granularity analysis. By parallelising several large functional programs this thesis demonstrates for the first time the advantages of combining lazy and parallel evaluation on a large scale: laziness aids modularity, while parallelism reduces runtime. One of the parallel programs is the Lolita system which, with more than 47,000 lines of code, is the largest existing parallel non-strict functional program. A new mechanism for parallel programming, evaluation strategies, to which this thesis contributes, is shown to be useful in this parallelisation. Evaluation strategies simplify parallel programming by separating algorithmic code from code specifying dynamic behaviour. For large programs the abstraction provided by functions is maintained by using a data-oriented style of parallelism, which defines parallelism over intermediate data structures rather than inside the functions. A highly parameterised simulator, GRANSIM, has been constructed collaboratively and is discussed in detail in this thesis. GRANSIM is a tool for architecture-independent parallelisation and a testbed for implementing runtime-system features of the parallel graph reduction model. By providing an idealised as well as an accurate model of the underlying parallel machine, GRANSIM has proven to be an essential part of an integrated parallel software engineering environment. Several parallel runtime- system features, such as granularity improvement mechanisms, have been tested via GRANSIM. It is publicly available and in active use at several universities worldwide. In order to provide granularity information this thesis presents an inference-based static granularity analysis. This analysis combines two existing analyses, one for cost and one for size information. It determines an upper bound for the computation costs of evaluating an expression in a simple strict higher-order language. By exposing recurrences during cost reconstruction and using a library of recurrences and their closed forms, it is possible to infer the costs for some recursive functions. The possible performance improvements are assessed by measuring the parallel performance of a hand-analysed and annotated program
PolyAPM: Vergleichende Parallelprogrammierung mit Abstrakten Parallelen Maschinen
A parallelising compilation consists of many translation and optimisation stages. The programmer may steer the compiler through these stages by supplying directives with the source code or setting compiler switches. However, for an evaluation of the effects of individual stages, their selection and their best order, this approach is not optimal. To solve this problem, we propose the following method. The compilation is cast as a sequence of program transformations. Each intermediate program runs on an Abstract Parallel Machine (APM), while the program generated by the final transformation runs on the target architecture. Our intermediate programs are all in the same language, Haskell. Thus, each program is executable and still abstract enough to be legible, which enables the evaluation of the transformation that generated it. This evaluation is supported by a cost model, which makes a performance prediction of the abstract program for a real machine. Our project, PolyAPM, provides an acyclic directed graph -- usually a tree -- of APMs whose traversal specifies different combinations and orders of transformations. From one source program, several target programs can be constructed. Their run time characteristics can be evaluated and compared. The goal of PolyAPM is not to support the one-off construction of parallel application programs. For the method's overhead to pay off, the project aims rather at supporting the construction and comparison of many similar variations of a parallel program and a comparative evaluation of parallelisation techniques. With the automation of transformations, PolyAPM can also be used to construct semi-automatic compilation systems.Eine parallelisierende Compilation besteht aus vielen Übersetzungs- und Optimierungsstufen. Der Programmierer kann den Compiler in diesen Stufen steuern, in dem er im Quellcode Anweisungen einfügt oder Compileroptionen verwendet. Für eine Bewertung der Auswirkungen der einzelnen Stufen, der Auswahl der Stufen und ihrer besten Reihenfolge ist der Ansatz aber nicht geeignet. Um dieses Problem zu lösen, schlagen wir folgende Methode vor. Eine Compilation wird als Abfolge von Programmtransformationen betrachtet. Jedes Zwischenprogramm gehört jeweils zu einer Abstrakten Parallelen Maschine (APM), während das durch die letzte Transformation erzeugte Program für die Zielarchitektur bestimmt ist. Alle Zwischenprogramme sind in der Sprache Haskell geschrieben. Dadurch ist jedes Programm ausführbar und trotzdem abstrakt genug, um gut lesbar zu sein. Durch diese Ausführbarkeit kann die Transformation, durch die das Programm erzeugt wird, bewertet werden. Diese Bewertung wird durch ein Kostenmodell unterstützt, das eine Performance-Vorhersage des abstrakten Programms, bezogen auf eine reale Maschine, ermöglicht. Unser Projekt PolyAPM liefert einen azyklischen, gerichteten Graphen - in der Regel einen Baum - aus APMs, dessen Traversierungen jeweils bestimmte Kombinationen und Reihenfolgen von Transformationen definieren. Aus einem Quellprogramm können verschiedene Zielprogramme erzeugt werden, deren Laufzeitverhalten bewert- und vergleichbar ist. Das Ziel von PolyAPM liegt nicht in der Erzeugung eines einzelnen, parallelen Programms. Damit sich der zusätzliche Aufwand der Methode auszahlt, richtet sich das Projekt eher auf die Entwicklung und den Vergleich vieler, ähnlicher Variationen eines parallelen Programms und der vergleichenden Bewertung von Parallelisierungstechniken. Mit der Automatisierung von Transformationen kann PolyAPM dazu benutzt werden, halbautomatische Compilations-Systeme zu bauen
The exploitation of parallelism on shared memory multiprocessors
PhD ThesisWith the arrival of many general purpose shared memory multiple processor
(multiprocessor) computers into the commercial arena during the mid-1980's, a
rift has opened between the raw processing power offered by the emerging
hardware and the relative inability of its operating software to effectively deliver
this power to potential users. This rift stems from the fact that, currently, no
computational model with the capability to elegantly express parallel activity is
mature enough to be universally accepted, and used as the basis for programming
languages to exploit the parallelism that multiprocessors offer. To add to this,
there is a lack of software tools to assist programmers in the processes of designing
and debugging parallel programs.
Although much research has been done in the field of programming languages,
no undisputed candidate for the most appropriate language for programming
shared memory multiprocessors has yet been found. This thesis examines why this
state of affairs has arisen and proposes programming language constructs,
together with a programming methodology and environment, to close the ever
widening hardware to software gap.
The novel programming constructs described in this thesis are intended for use
in imperative languages even though they make use of the synchronisation
inherent in the dataflow model by using the semantics of single assignment when
operating on shared data, so giving rise to the term shared values. As there are
several distinct parallel programming paradigms, matching flavours of shared
value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
- …