21,628 research outputs found
Analysing Astronomy Algorithms for GPUs and Beyond
Astronomy depends on ever increasing computing power. Processor clock-rates
have plateaued, and increased performance is now appearing in the form of
additional processor cores on a single chip. This poses significant challenges
to the astronomy software community. Graphics Processing Units (GPUs), now
capable of general-purpose computation, exemplify both the difficult
learning-curve and the significant speedups exhibited by massively-parallel
hardware architectures. We present a generalised approach to tackling this
paradigm shift, based on the analysis of algorithms. We describe a small
collection of foundation algorithms relevant to astronomy and explain how they
may be used to ease the transition to massively-parallel computing
architectures. We demonstrate the effectiveness of our approach by applying it
to four well-known astronomy problems: Hogbom CLEAN, inverse ray-shooting for
gravitational lensing, pulsar dedispersion and volume rendering. Algorithms
with well-defined memory access patterns and high arithmetic intensity stand to
receive the greatest performance boost from massively-parallel architectures,
while those that involve a significant amount of decision-making may struggle
to take advantage of the available processing power.Comment: 10 pages, 3 figures, accepted for publication in MNRA
Using graphics processors to accelerate the computation of the matrix inverse
We study the use of massively parallel architectures for computing a matrix
inverse. Two different algorithms are reviewed, the traditional approach based on
Gaussian elimination and the Gauss-Jordan elimination alternative, and several high
performance implementations are presented and evaluated. The target architecture is a
current general-purpose multi-core processor (CPU) connected to a graphics processor
(GPU). Numerical experiments show the efficiency attained by the proposed implementations
and how the computation of large-scale inverses, which only a few years
ago would have required a distributed-memory cluster, take only a few minutes on a
hybrid architecture formed by a multi-core CPU and a GPU
Implementing the conjugate gradient algorithm on multi-core systems
In linear solvers, like the conjugate gradient algorithm, sparse-matrix vector multiplication is an important kernel. Due to the sparseness of the matrices, the solver runs relatively slow. For digital optical tomography (DOT), a large set of linear equations have to be solved which currently takes in the order of hours on desktop computers. Our goal was to speed up the conjugate gradient solver. In this paper we present the results of applying multiple optimization techniques and exploiting multi-core solutions offered by two recently introduced architectures: Intel’s Woodcrest\ud
general purpose processor and NVIDIA’s G80 graphical processing unit. Using these techniques for these architectures, a speedup of a factor three\ud
has been achieved
A Modeling Approach based on UML/MARTE for GPU Architecture
Nowadays, the High Performance Computing is part of the context of embedded
systems. Graphics Processing Units (GPUs) are more and more used in
acceleration of the most part of algorithms and applications. Over the past
years, not many efforts have been done to describe abstractions of applications
in relation to their target architectures. Thus, when developers need to
associate applications and GPUs, for example, they find difficulty and prefer
using API for these architectures. This paper presents a metamodel extension
for MARTE profile and a model for GPU architectures. The main goal is to
specify the task and data allocation in the memory hierarchy of these
architectures. The results show that this approach will help to generate code
for GPUs based on model transformations using Model Driven Engineering (MDE).Comment: Symposium en Architectures nouvelles de machines (SympA'14) (2011
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
- …