121 research outputs found

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    Mini-app driven optimisation of inertial confinement fusion codes

    Get PDF
    In September 2013, the large laser-based inertial confinement fusion device housed in the National Ignition Facility at Lawrence Livermore National Laboratory, was widely acclaimed to have achieved a milestone in controlled fusion – successfully initiating a reaction that resulted in the release of more energy than the fuel absorbed. Despite this success, we remain some distance from being able to create controlled, self-sustaining fusion reactions. Inertial Confinement Fusion (ICF) represents one leading design for the generation of energy by nuclear fusion. Since the 1950s, ICF has been supported by computing simulations, providing the mathematical foundations for pulse shaping, lasers, and material shells needed to ensure effective and efficient implosion. The research presented here focuses on one such simulation code, EPOCH, a fully relativistic particle-in-cell plasma physics code, developed by a leading network of over 30 UK researchers. A significant challenge in developing large codes like EPOCH is maintaining effective scientific delivery on successive generations of high-performance computing architecture. To support this process, we adopt the use of mini-applications – small code proxies that encapsulate important computational properties of their larger parent counterparts. Through the development of a miniapp for EPOCH (called miniEPOCH), we investigate known timestep scaling issues within EPOCH and explore possible optimisations: (i) Employing loop fission to increase levels of vectorisation; (ii) Enforcing particle ordering to allow the exploitation of domain specific knowledge and, (iii) Changing underlying data storage to improve memory locality. When applied to EPOCH, these improvements represent a 2.02× speed-up in the core algorithm and a 1.55× speed-up to the overall application runtime, when executed on EPCC’s Cray XC30 ARCHER platform

    Big data, modeling, simulation, computational platform and holistic approaches for the fourth industrial revolution

    Get PDF
    Naturally, the mathematical process starts from proving the existence and uniqueness of the solution by the using the theorem, corollary, lemma, proposition, dealing with the simple and non-complex model. Proving the existence and uniqueness solution are guaranteed by governing the infinite amount of solutions and limited to the implementation of a small-scale simulation on a single desktop CPU. Accuracy, consistency and stability were easily controlled by a small data scale. However, the fourth industrial can be described the mathematical process as the advent of cyber-physical systems involving entirely new capabilities for researcher and machines (Xing, 2017). In numerical perspective, the fourth industrial revolution (4iR) required the transition from a uncomplex model and small scale simulation to complex model and big data for visualizing the real-world application in digital dialectical and exciting opportunity. Thus, a big data analytics and its classification are a problem solving for these limitations. Some applications of 4iR will highlight the extension version in terms of models, derivative and discretization, dimension of space and time, behavior of initial and boundary conditions, grid generation, data extraction, numerical method and image processing with high resolution feature in numerical perspective. In statistics, a big data depends on data growth however, from numerical perspective, a few classification strategies will be investigated deals with the specific classifier tool. This paper will investigate the conceptual framework for a big data classification, governing the mathematical modeling, selecting the superior numerical method, handling the large sparse simulation and investigating the parallel computing on high performance computing (HPC) platform. The conceptual framework will benefit to the big data provider, algorithm provider and system analyzer to classify and recommend the specific strategy for generating, handling and analyzing the big data. All the perspectives take a holistic view of technology. Current research, the particular conceptual framework will be described in holistic terms. 4iR has ability to take a holistic approach to explain an important of big data, complex modeling, large sparse simulation and high performance computing platform. Numerical analysis and parallel performance evaluation are the indicators for performance investigation of the classification strategy. This research will benefit to obtain an accurate decision, predictions and trending practice on how to obtain the approximation solution for science and engineering applications. As a conclusion, classification strategies for generating a fine granular mesh, identifying the root causes of failures and issues in real time solution. Furthermore, the big data-driven and data transfer evolution towards high speed of technology transfer to boost the economic and social development for the 4iR (Xing, 2017; Marwala et al., 2017)

    A metadata-enhanced framework for high performance visual effects

    No full text
    This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a vendor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    Performance modelling and optimisation of inertial confinement fusion simulation codes

    Get PDF
    Legacy code performance has failed to keep up with that of modern hardware. Many new hardware features remain under-utilised, with the majority of code bases still unable to make use of accelerated or heterogeneous architectures. Code maintainers now accept that they can no longer rely solely on hardware improvements to drive code performance, and that changes at the software engineering level need to be made. The principal focus of the work presented in this thesis is an analysis of the changes legacy Inertial Confinement Fusion (ICF) codes need to make in order to efficiently use current and future parallel architectures. We discuss the process of developing a performance model, and demonstrate the ability of such a model to make accurate predictions about code performance for code variants on a range of architectures. We build on the knowledge gained from such a process, and examine how Particle-in-Cell (PIC) codes must change in order to move towards the required levels of portable and future-proof performance needed to leverage the capabilities of modern hardware. As part of this investigation, we present an OpenCL port of the legacy code EPOCH, as well as a fully featured mini-app representing EPOCH. Finally, as a direct consequence of these investigations, we directly apply these performance optimisations to the production version EPOCH, culminating in a speedup of over 2x for the core algorith
    corecore