73 research outputs found

    Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

    Get PDF
    Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA

    Abstractions and performance optimisations for finite element methods

    Get PDF
    Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing. In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance. Domain-specific systems based on code generation techniques, such as Firedrake, attempt to address this problem with a design consisting of a hierarchy of abstractions, where the users can specify the mathematical problems via a high-level, descriptive interface, which is progressively lowered through the intermediate abstractions. Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently, generating high-performance code without user intervention. This thesis discusses several topics on the design of the abstraction layers of Firedrake, and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers. In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra. We successfully lift specific loop optimisations, previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language, improving the compilation speed and optimisation effectiveness. The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh. We redesign the abstraction to express the global assembly loop nests using tools and concepts based on the polyhedral model. This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically. This abstraction also improves the portability of Firedrake, as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces

    Productivity, performance, and portability for computational fluid dynamics applications

    Get PDF
    Hardware trends over the last decade show increasing complexity and heterogeneity in high performance computing architectures, which presents developers of CFD applications with three key challenges; the need for achieving good performance, being able to utilise current and future hardware by being portable, and doing so in a productive manner. These three appear to contradict each other when using traditional programming approaches, but in recent years, several strategies such as template libraries and Domain Specific Languages have emerged as a potential solution; by giving up generality and focusing on a narrower domain of problems, all three can be achieved. This paper gives an overview of the state-of-the-art for delivering performance, portability, and productivity to CFD applications, ranging from high-level libraries that allow the symbolic description of PDEs to low-level techniques that target individual algorithmic patterns. We discuss advantages and challenges in using each approach, and review the performance benchmarking literature that compares implementations for hardware architectures and their programming methods, giving an overview of key applications and their comparative performance

    Parallel Optimisations of Perceived Quality Simulations

    Get PDF
    Processor architectures have changed significantly, with fast single core processors replaced by a diverse range of multicore processors. New architectures require code to be executed in parallel to realize these performance gains. This is straightforward for small applications, where analysis and refactoring is simple or existing tools can parallelise automatically, however the organic growth and large complicated data structures common in mature industrial applications can make parallelisation more difficult. One such application is studied, a mature Windows C++ application used for the visualisation of Perceived Quality (PQ). PQ simulations enable the visualisation of how manufacturing variations affect the look of the final product. The application is commonly used, however suffers from performance issues. Previous parallelisation attempts have failed. The issues associated with parallelising a mature industrial application are investigated. A methodology to investigate, analyse and evaluate the methods and tools available is produced. The shortfalls of these methods and tools are identified, and the methods used to overcome them explained. The parallel version of the software is evaluated for performance. Case studies centring on the significant use cases of the application help to understand the impact on the user. Automated compilers provided no parallelism, while the manual parallelisation using OpenMP required significant refactoring. A number of data dependency issues resulted in some serialised code. Performance scaled with the number of physical cores when applied to certain problems, however the unresolved bottlenecks resulted in mixed results for users. Use in verification did benefit, however those in early design stages did not. Without tools to aid analysis of complex data structures, parallelism could remain out of reach for industrial applications. Methods used here successfully, such as serialisation, and code isolation and serialisation, could be used effectively by such tools

    Data structure abstraction and parallelisation of multi-material hydrodynamic applications

    Get PDF
    The aim for High Performance Computing (HPC) is to achieve the best performance for an application, in order to execute it as quickly as possible. This is often achieved through iterative improvements in Central Processing Unit (CPU) technology such as: including more circuitry by shrinking or making processors larger; making the processor run faster by increasing the clock speed; or increasing the amount of parallelism. Recently, there has been increasing diversity in how HPC systems achieve these performance improvements. The use of Graphics Processing Unit (GPU) processors has become more common, and there has been a growing interest in high bandwidth memory. This has lead to a need for performance portable code, so programs may be written once but compiled and ran on a range of differing systems, with minimal impact on the performance. As memory becomes a major focus, so too should the data structure used by an application. Without a well designed data structure, the performance of a program can be affected. However, it is key that this is done in a performance portable way, where the data structure can be altered and optimised without the need for the application to be rewritten. As such, a data structure abstraction library was developed, calledWarwick Data Store (WDS). This library is able to provide objects, which allow for access to data, without the application needing to know the detail of the data structure. The library also provides additional functionality that would otherwise be difficult and time consuming to implement, such as the ability to convert a variable or a collection of variables from one data structure to another. The performance impact of the library is shown to be minimal, especially in larger problem sizes. Because of the flexibility of the library, data structures for specialised cases can be implemented into WDS without impacting the performance of other data structures. The performance of these specialised data structures is also presented as being minimal
    • …
    corecore