1,623 research outputs found

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

    Challenges in using GPUs for the real-time reconstruction of digital hologram images

    Get PDF
    This is the pre-print version of the final published paper that is available from the link below.In-line holography has recently made the transition from silver-halide based recording media, with laser reconstruction, to recording with large-area pixel detectors and computer-based reconstruction. This form of holographic imaging is an established technique for the study of fine particulates, such as cloud or fuel droplets, marine plankton and alluvial sediments, and enables a true 3D object field to be recorded at high resolution over a considerable depth. The move to digital holography promises rapid, if not instantaneous, feedback as it avoids the need for the time-consuming chemical development of plates or film film and a dedicated replay system, but with the growing use of video-rate holographic recording, and the desire to reconstruct fully every frame, the computational challenge becomes considerable. To replay a digital hologram a 2D FFT must be calculated for every depth slice desired in the replayed image volume. A typical hologram of ~100 μm particles over a depth of a few hundred millimetres will require O(10^3) 2D FFT operations to be performed on a hologram of typically a few million pixels. In this paper we discuss the technical challenges in converting our existing reconstruction code to make efficient use of NVIDIA CUDA-based GPU cards and show how near real-time video slice reconstruction can be obtained with holograms as large as 4096 by 4096 pixels. Our performance to date for a number of different NVIDIA GPU running under both Linux and Microsoft Windows is presented. The recent availability of GPU on portable computers is discussed and a new code for interactive replay of digital holograms is presented

    Many-core applications to online track reconstruction in HEP experiments

    Full text link
    Interest in parallel architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic Processing Units (GPUs) and Intel Many Integrated Core architecture (MIC) when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction - the SVT algorithm - as a realistic test-case for low-latency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the parallel devices.Comment: Proceedings for the 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP); missing acks adde

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

    Computational Physics on Graphics Processing Units

    Full text link
    The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various different algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics, and on quantum simulations for electronic structure calculations using the density functional theory, wave function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012, Helsinki, Finland, June 10-13, 201
    corecore