3,171 research outputs found
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming next-generation accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Evaluating kernels on Xeon Phi to accelerate Gysela application
This work describes the challenges presented by porting parts ofthe Gysela
code to the Intel Xeon Phi coprocessor, as well as techniques used for
optimization, vectorization and tuning that can be applied to other
applications. We evaluate the performance of somegeneric micro-benchmark on Phi
versus Intel Sandy Bridge. Several interpolation kernels useful for the Gysela
application are analyzed and the performance are shown. Some memory-bound and
compute-bound kernels are accelerated by a factor 2 on the Phi device compared
to Sandy architecture. Nevertheless, it is hard, if not impossible, to reach a
large fraction of the peek performance on the Phi device,especially for
real-life applications as Gysela. A collateral benefit of this optimization and
tuning work is that the execution time of Gysela (using 4D advections) has
decreased on a standard architecture such as Intel Sandy Bridge.Comment: submitted to ESAIM proceedings for CEMRACS 2014 summer school version
reviewe
Study of Parallel Programming Models on Computer Clusters with Accelerators
In order to reach exascale computing capability, accelerators have become a crucial part in developing supercomputers. This work examines the potential of two latest acceleration technologies, Intel Many Integrated Core (MIC) Architecture and Graphics Processing Units (GPUs). This thesis applies three benchmarks under 3 different configurations, MPI+CPU, MPI+GPU, and MPI+MIC. The benchmarks include intensely communicating application, loosely communicating application, and embarrassingly parallel application. This thesis also carries out a detailed study on the scalability and performance of MIC processors under two programming models, i.e., offload model and native model, on the Beacon computer cluster.
According to different benchmarks, the results demonstrate different performance and scalability between GPU and MIC. (1) For embarrassingly parallel case, GPU-based parallel implementation on Keeneland computer cluster has a better performance than other accelerators. However, MIC-based parallel implementation shows a better scalability than the implementation on GPU. The performances of native model and offload model on MIC are very close. (2) For loosely communicating case, the performances on GPU and MIC are very close. The MIC-based parallel implementation still demonstrates a strong scalability when using 120 MIC processors in computation. (3) For the intensely communicating case, the MPI implementations on CPUs and GPUs both have a strong scalability. GPUs can consistently outperform other accelerators. However, the MIC-based implementation cannot scale quite well. The performance of different models on MIC is different from the performance of embarrassingly parallel case. Native model can consistently outperform the offload model by ~10 times. And there is not much performance gain when allocating more MIC processors. The increase of communication cost will offset the performance gain from the reduced workload on each MIC core. This work also tests the performance capabilities and scalability by changing the number of threads on each MIC card form 10 to 60. When using different number of threads for the intensely communicating case, it shows different capabilities of the MIC based offload model. The scalability can hold when the number of threads increases from 10 to 30, and the computation time reduces with a smaller rate from 30 threads to 50 threads. When using 60 threads, the computation time will increase. The reason is that the communication overhead will offset the performance gain when 60 threads are deployed on a single MIC card
- …