260 research outputs found
Developing performance-portable molecular dynamics kernels in Open CL
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs.
We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
WMTrace : a lightweight memory allocation tracker and analysis framework
The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes
Parallelising wavefront applications on general-purpose GPU devices
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups
Experiences with porting and modelling wavefront algorithms on many-core architectures
We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA).
This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC).
In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale
Towards a portable and future-proof particle-in-cell plasma physics code
We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes.
We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
Anomalous Hall magnetoresistance in a ferromagnet
The anomalous Hall effect, observed in conducting ferromagnets with broken
time-reversal symmetry, offers the possibility to couple spin and orbital
degrees of freedom of electrons in ferromagnets. In addition to charge, the
anomalous Hall effect also leads to spin accumulation at the surfaces
perpendicular to both the current and magnetization direction. Here we
experimentally demonstrate that the spin accumulation, subsequent spin
backflow, and spin-charge conversion can give rise to a different type of spin
current related magnetoresistance, dubbed here as the anomalous Hall
magnetoresistance, which has the same angular dependence as the recently
discovered spin Hall magnetoresistance. The anomalous Hall magnetoresistance is
observed in four types of samples: co-sputtered (Fe1-xMnx)0.6Pt0.4, Fe1-xMnx
and Pt multilayer, Fe1-xMnx with x = 0.17 to 0.65 and Fe, and analyzed using
the drift-diffusion model. Our results provide an alternative route to study
charge-spin conversion in ferromagnets and to exploit it for potential
spintronic applications
Dopants adsorbed as single atoms prevent degradation of catalysts
The design of catalysts with desired chemical and thermal properties is
viewed as a grand challenge for scientists and engineers. For operation at high
temperatures, stability against structural transformations is a key
requirement. Although doping has been found to impede degradation, the lack of
atomistic understanding of the pertinent mechanism has hindered optimization.
For example, porous gamma-Al2O3, a widely used catalyst and catalytic support,
transforms to non-porous alpha-Al2O3 at ~1,100C. Doping with La raises the
transformation temperature to ~1,250C, but it has not been possible to
establish if La atoms enter the bulk, adsorb on surfaces as single atoms or
clusters, or form surface compounds. Here, we use direct imaging by
aberration-corrected Z-contrast scanning transmission electron microscopy
coupled with extended X-ray absorption fine structure and first-principles
calculations to demonstrate that, contrary to expectations, stabilization is
achieved by isolated La atoms adsorbed on the surface. Strong binding and
mutual repulsion of La atoms effectively pin the surface and inhibit both
sintering and the transformation to alpha-Al2O3. The results provide the first
guidelines for the choice of dopants to prevent thermal degradation of
catalysts and other porous materials.Comment: RevTex4, 4 pages, 4 JPEG figures, published in Nature Material
- …