Search CORE

882 research outputs found

GPU Accelerated Particle Visualization with Splotch

Author: Dolag Klaus
Dykes Tim
Gheller Claudio
Krokos Mel
Rivi Marzia
Publication venue: 'Elsevier BV'
Publication date: 23/03/2014
Field of study

Splotch is a rendering algorithm for exploration and visual discovery in particle-based datasets coming from astronomical observations or numerical simulations. The strengths of the approach are production of high quality imagery and support for very large-scale datasets through an effective mix of the OpenMP and MPI parallel programming paradigms. This article reports our experiences in re-designing Splotch for exploiting emerging HPC architectures nowadays increasingly populated with GPUs. A performance model is introduced for data transfers, computations and memory access, to guide our re-factoring of Splotch. A number of parallelization issues are discussed, in particular relating to race conditions and workload balancing, towards achieving optimal performances. Our implementation was accomplished by using the CUDA programming paradigm. Our strategy is founded on novel schemes achieving optimized data organisation and classification of particles. We deploy a reference simulation to present performance results on acceleration gains and scalability. We finally outline our vision for future work developments including possibilities for further optimisations and exploitation of emerging technologies.Comment: 25 pages, 9 figures. Astronomy and Computing (2014

arXiv.org e-Print Archive

Portsmouth University Research Portal (Pure)

Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

Author: Cooper Lee
Kong Jun
Kurc Tahsin
Pan Tony
Saltz Joel
Teodoro George
Publication venue
Publication date: 14/09/2012
Field of study

In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.Comment: 37 pages, 16 figure

arXiv.org e-Print Archive

CiteSeerX

Enabling a High Throughput Real Time Data Pipeline for a Large Radio Telescope Array with GPUs

Author: Bracewell
Corey
D.A. Mitchell
DeBoer
Deboer
Furlanetto
H. Pfister
Hey
Högbom
K. Dale
L.J. Greenhill
Lonsdale
M.A. Clark
Mitchell
Mitchell
Momjian
R.B. Wayth
R.G. Edgar
S.M. Ord
Salah
Schwab
Terzian
Thompson
Thompson
Varbanescu
Wayth
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

The Murchison Widefield Array (MWA) is a next-generation radio telescope currently under construction in the remote Western Australia Outback. Raw data will be generated continuously at 5GiB/s, grouped into 8s cadences. This high throughput motivates the development of on-site, real time processing and reduction in preference to archiving, transport and off-line processing. Each batch of 8s data must be completely reduced before the next batch arrives. Maintaining real time operation will require a sustained performance of around 2.5TFLOP/s (including convolutions, FFTs, interpolations and matrix multiplications). We describe a scalable heterogeneous computing pipeline implementation, exploiting both the high computing density and FLOP-per-Watt ratio of modern GPUs. The architecture is highly parallel within and across nodes, with all major processing elements performed by GPUs. Necessary scatter-gather operations along the pipeline are loosely synchronized between the nodes hosting the GPUs. The MWA will be a frontier scientific instrument and a pathfinder for planned peta- and exascale facilities.Comment: Version accepted by Comp. Phys. Com

arXiv.org e-Print Archive

Crossref

Harvard University - DASH

espace@Curtin

PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION

Author: N Chiplunkar Niranjan
Raju K.
Publication venue: Lublin University of Technology
Publication date: 01/01/2021
Field of study

The CPU-GPU combination is a widely used heterogeneous computing system in which the CPU and GPU have different address spaces. Since the GPU cannot directly access the CPU memory, prior to invoking the GPU function the input data must be available on the GPU memory. On completion of GPU function, the results of computation are transferred to CPU memory. The CPU-GPU data transfer happens through PCI-Express bus. The PCI-E bandwidth is much lesser than that of GPU memory. The speed at which the data is transferred is limited by the PCI-E bandwidth. Hence, the PCI-E acts as a performance bottleneck. In this paper two approaches are discussed to minimize the overhead of data transfer, namely, performing the data transfer while the GPU function is being executed and reducing the amount of data to be transferred to GPU.  The effectiveness of these approaches on the execution time of a set of CUDA applications is realized using CUDA streams. The results of our experiments show that the execution time of applications can be minimized with the proposed approaches

Biblioteka Nauki - repozytorium artykuÅÃ³w

Lublin University of Technology Journals

Directory of Open Access Journals