32 research outputs found

    Strategies and tools for the exploitation of massively parallel computer systems

    Get PDF
    The aim of this thesis is to develop software and strategies for the exploitation of parallel computer hardware, in particular distributed memory systems, and embedding these strategies within a parallelisation tool to allow the automatic generation of these strategies. The parallelisation of four structured mesh codes using the Computer Aided Parallelisation Tools provided a good initial parallelisation of the codes. However, investigation revealed that simple optimisation of the communications within these codes provided an even better improvement in performance. The dominant factor within the communications was the data transfer time with communication start-up latencies also significant. This was significant throughout the codes but especially in sections of pipelined code where there were large amounts of communication present. This thesis describes the development and testing of the methods used to increase the performance of these communications by overlapping them with unrelated calculation. This method of overlapping the communications was applied to the exchange of data communications as well as the pipelined communications. The successful application by hand provided the motivation for these methods to be incorporated and automatically generated within the Computer Aided Parallelisation Tools. These methods were integrated within these tools as an additional stage of the parallelisation. This required a generic algorithm that made use of many of the symbolic algebra tests and symbolic variable manipulation routines within the tools. The automatic generation of overlapped communications was applied to the four codes previously parallelised as well as a further three codes, one of which was a real world Computational Fluid Dynamics code. The methods to apply automatic generation of overlapped communications to unstructured mesh codes were also discussed. These methods are similar to those applied to the structured mesh codes and their automation is viewed to be of a similar fashion

    Quick and practical run-time evaluation of multiple program optimizations

    Get PDF
    This article aims at making iterative optimization practical and usable by speeding up the evaluation of a large range of optimizations. Instead of using a full run to evaluate a single program optimization, we take advantage of periods of stable performance, called phases. For that purpose, we propose a low-overhead phase detection scheme geared toward fast optimization space pruning, using code instrumentation and versioning implemented in a production compiler. Our approach is driven by simplicity and practicality. We show that a simple phase detection scheme can be sufficient for optimization space pruning. We also show it is possible to search for complex optimizations at run-time without resorting to sophisticated dynamic compilation frameworks. Beyond iterative optimization, our approach also enables one to quickly design selftuned applications. Considering 5 representative SpecFP2000 benchmarks, our approach speeds up iterative search for the best program optimizations by a factor of 32 to 962. Phase prediction is 99.4% accurate on average, with an overhead of only 2.6%. The resulting self-tuned implementations bring an average speed-up of 1.4

    A Language for Specifying Compiler Optimizations for Generic Software

    Full text link

    An Active-Library Based Investigation into the Performance Optimisation of Linear Algebra and the Finite Element Method

    No full text
    In this thesis, I explore an approach called "active libraries". These are libraries that take part in their own optimisation, enabling both high-performance code and the presentation of intuitive abstractions. I investigate the use of active libraries in two domains. Firstly, dense and sparse linear algebra, particularly, the solution of linear systems of equations. Secondly, the specification and solution of finite element problems. Extending my earlier (MEng) thesis work, I describe the modifications to my linear algebra library "Desola" required to perform sparse-matrix code generation. I show that optimisations easily applied in the dense case using code-transformation must be applied at a higher level of abstraction in the sparse case. I present performance results for sparse linear system solvers generated using Desola and compare against an implementation using the Intel Math Kernel Library. I also present improved dense linear-algebra performance results. Next, I explore the active-library approach by developing a finite element library that captures runtime representations of basis functions, variational forms and sequences of operations between discretised operators and fields. Using captured representations of variational forms and basis functions, I demonstrate optimisations to cell-local integral assembly that this approach enables, and compare against the state of the art. As part of my work on optimising local assembly, I extend the work of Hosangadi et al. on common sub-expression elimination and factorisation of polynomials. I improve the weight function presented by Hosangadi et al., increasing the number of factorisations found. I present an implementation of an optimised branch-and-bound algorithm inspired by reformulating the original matrix-covering problem as a maximal graph biclique search problem. I evaluate the algorithm's effectiveness on the expressions generated by our finite element solver

    Hybrid eager and lazy evaluation for efficient compilation of Haskell

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 208-220).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.The advantage of a non-strict, purely functional language such as Haskell lies in its clean equational semantics. However, lazy implementations of Haskell fall short: they cannot express tail recursion gracefully without annotation. We describe resource-bounded hybrid evaluation, a mixture of strict and lazy evaluation, and its realization in Eager Haskell. From the programmer's perspective, Eager Haskell is simply another implementation of Haskell with the same clean equational semantics. Iteration can be expressed using tail recursion, without the need to resort to program annotations. Under hybrid evaluation, computations are ordinarily executed in program order just as in a strict functional language. When particular stack, heap, or time bounds are exceeded, suspensions are generated for all outstanding computations. These suspensions are re-started in a demand-driven fashion from the root. The Eager Haskell compiler translates Ac, the compiler's intermediate representation, to efficient C code. We use an equational semantics for Ac to develop simple correctness proofs for program transformations, and connect actions in the run-time system to steps in the hybrid evaluation strategy.(cont.) The focus of compilation is efficiency in the common case of straight-line execution; the handling of non-strictness and suspension are left to the run-time system. Several additional contributions have resulted from the implementation of hybrid evaluation. Eager Haskell is the first eager compiler to use a call stack. Our generational garbage collector uses this stack as an additional predictor of object lifetime. Objects above a stack watermark are assumed to be likely to die; we avoid promoting them. Those below are likely to remain untouched and therefore are good candidates for promotion. To avoid eagerly evaluating error checks, they are compiled into special bottom thunks, which are treated specially by the run-time system. The compiler identifies error handling code using a mixture of strictness and type information. This information is also used to avoid inlining error handlers, and to enable aggressive program transformation in the presence of error handling.by Jan-Willem Maessen.Ph.D

    On the classification and evaluation of prefetching schemes

    Get PDF
    Abstract available: p. [2

    Workshop - Systems Design Meets Equation-based Languages

    Get PDF

    Directive-based Approach to Heterogeneous Computing

    Get PDF
    El mundo de la computaci贸n de altas prestaciones est谩 sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de c贸mputo para suplir las necesidades de la comunidad cient铆fica ha forzado la irrupci贸n de arquitecturas hardware masivamente paralelas y de unidades espec铆ficas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gr谩fico). Estos dispositivos, tradicionalmente dedicados a la programaci贸n gr谩fica, se han convertido recientemente en una plataforma ideal para implementar c贸mputos masivamente paralelos. La combinaci贸n de GPUs para realizar tareas intensivas en c贸mputo con multi-procesadores para llevar tareas menos intensas pero con l贸gica de control m谩s compleja, se ha convertido en los 煤ltimos a帽os en una de las plataformas m谩s comunes para la realizaci贸n de c谩lculos cient铆ficos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de cl煤sters de peque帽o o mediano tama帽o, con un coste inicial y de mantenimiento notablemente inferior. La incorporaci贸n de GPUs en cl煤sters ha permitido tambi茅n aumentar la capacidad de 茅stos. Sin embargo, la complejidad de la programaci贸n de GPUs, y su integraci贸n con c贸digos existentes, dificultan enormemente la introducci贸n de estas tecnolog铆as entre usuarios menos expertos. En esta t茅sis exploramos la utilizaci贸n de modelos de programaci贸n basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y cl煤sters, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programaci贸n en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas t茅cnicas con relativamente poco esfuerzo. La irrupci贸n del est谩ndar de programaci贸n de directivas de OpenACC nos permiti贸 demostrar la capacidad de estas herramientas, realizando una implementaci贸n experimental del est谩ndar (accULL) en muy poco tiempo y con un rendimiento nada desde帽able. Los resultados computacionales aportados nos permiten demostrar: (a) La disminuci贸n en el esfuerzo de programaci贸n que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas dise帽adas durante esta t茅sis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales

    Just-in-time Hardware generation for abstracted reconfigurable computing

    Get PDF
    This thesis addresses the use of reconfigurable hardware in computing platforms, in order to harness the performance benefits of dedicated hardware whilst maintaining the flexibility associated with software. Although the reconfigurable computing concept is not new, the low level nature of the supporting tools normally used, together with the consequent limited level of abstraction and resultant lack of backwards compatibility, has prevented the widespread adoption of this technology. In addition, bandwidth and architectural limitations, have seriously constrained the potential improvements in performance. A review of existing approaches and tools flows is conducted to highlight the current problems being faced in this field. The objective of the work presented in this thesis is to introduce a radically new approach to reconfigurable computing tool flows. The runtime based tool flow introduces complete abstraction between the application developer and the underlying hardware. This new technique eliminates the ease of use and backwards compatibility issues that have plagued the reconfigurable computing concept, and could pave the way for viable mainstream reconfigurable computing platforms. An easy to use, cycle accurate behavioural modelling system is also presented, which was used extensively during the early exploration of new concepts and architectures. Some performance improvements produced by the new reconfigurable computing tool flow, when applied to both a MIPS based embedded platform, and the Cray XDl, are also presented. These results are then analyzed and the hardware and software factors affecting the performance increases that were obtained are discussed, together with potential techniques that could be used to further increase the performance of the system. Lastly a heterogenous computing concept is proposed, in which, a computer system, containing multiple types of computational resource is envisaged, each having their own strengths and weaknesses (e.g. DSPs, CPUs, FPGAs). A revolutionary new method of fully exploiting the potential of such a system, whilst maintaining scalability, backwards compatibility, and ease of use is also presented
    corecore