277 research outputs found

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

    Towards a portable and future-proof particle-in-cell plasma physics code

    Get PDF
    We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes. We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs

    Get PDF
    In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain

    A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Applications

    Get PDF
    Heterogeneous High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to automatically create the best program variant by transforming a given original program. We have developed a novel methodology based on type transformations for generating correct-by-construction design variants, and an associated light-weight cost model for evaluating these variants for implementation on FPGAs. In this paper we present a key enabler of our approach, the cost model. We discuss how we are able to quickly derive accurate estimates of performance and resource-utilization from the design’s representation in our intermediate language. We show results confirming the accuracy of our cost model by testing it on three different scientific kernels. We conclude with a case-study that compares a solution generated by our framework with one from a conventional high-level synthesis tool, showing better performance and power-efficiency using our cost model based approach

    Towards Energy Efficiency in Heterogeneous Processors: Findings on Virtual Screening Methods

    Get PDF
    The integration of the latest breakthroughs in computational modeling and high performance computing (HPC) has leveraged advances in the fields of healthcare and drug discovery, among others. By integrating all these developments together, scientists are creating new exciting personal therapeutic strategies for living longer that were unimaginable not that long ago. However, we are witnessing the biggest revolution in HPC in the last decade. Several graphics processing unit architectures have established their niche in the HPC arena but at the expense of an excessive power and heat. A solution for this important problem is based on heterogeneity. In this paper, we analyze power consumption on heterogeneous systems, benchmarking a bioinformatics kernel within the framework of virtual screening methods. Cores and frequencies are tuned to further improve the performance or energy efficiency on those architectures. Our experimental results show that targeted low‐cost systems are the lowest power consumption platforms, although the most energy efficient platform and the best suited for performance improvement is the Kepler GK110 graphics processing unit from Nvidia by using compute unified device architecture. Finally, the open computing language version of virtual screening shows a remarkable performance penalty compared with its compute unified device architecture counterpart.Ingeniería, Industria y Construcció
    corecore