340 research outputs found
Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU
The vision of super computer at every desk can be realized by powerful and
highly parallel CPUs or GPUs or APUs. Graphics processors once specialized for
the graphics applications only, are now used for the highly computational
intensive general purpose applications. Very expensive GFLOPs and TFLOP
performance has become very cheap with the GPGPUs. Current work focuses mainly
on the highly parallel implementation of Matrix Exponentiation. Matrix
Exponentiation is widely used in many areas of scientific community ranging
from highly critical flight, CAD simulations to financial, statistical
applications. Proposed solution for Matrix Exponentiation uses OpenCL for
exploiting the hyper parallelism offered by the many core GPGPUs. It employs
many general GPU optimizations and architectural specific optimizations. This
experimentation covers the optimizations targeted specific to the Scientific
Graphics cards (Tesla-C2050). Heterogeneous Highly Parallel Matrix
Exponentiation method has been tested for matrices of different sizes and with
different powers. The devised Kernel has shown 1000X speedup and 44 fold
speedup with the naive GPU Kernel.Comment: 15 pages, 12 figures, International Journal of Distributed and
Parallel systems (IJDPS) ISSN : 0976 - 9757 [Online] ; 2229 - 3957 [Print
Graphycs supercomputing applied to brain image analysis with niftyreg
Abstract: Medical image processing in general and brain image processing in
particular are computationally intensive tasks. Luckily, their use can be liberalized by
means of techniques such as GPU programming. In this article we study NiftyReg, a
brain image processing library with a GPU implementation using CUDA, and analyse
different possible ways of further optimising the existing codes. We will focus on fully
using the memory hierarchy and on exploiting the computational power of the CPU. The
ideas that lead us towards the different attempts to change and optimize the code will
be shown as hypotheses, which we will then test empirically using the results obtained
from running the application. Finally, for each set of related optimizations we will study
the validity of the obtained results in terms of both performance and the accuracy of
the resulting images
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators
A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes.
We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lowerâorder, oneâdimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on nonâregular (nonâuniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, outâofâcore approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations
Master of Science
thesisThe advent of the era of cheap and pervasive many-core and multicore parallel sys-tems has highlighted the disparity of the performance achieved between novice and expert developers targeting parallel architectures. This disparity is most notiable with software for running general purpose computations on grachics processing units (GPGPU programs). Current methods for implementing GPGPU programs require an expert level understanding of the memory hierarchy and execution model of the hardware to reach peak performance. Even for experts, rewriting a program to exploit these hardware features can be tedious and error prone. Compilers and their ability to make code transformations can assist in the implementation of GPGPU programs, handling many of the target specic details. This thesis presents CUDA-CHiLL, a source to source compiler transformation and code generation framework for the parallelization and optimization of computations expressed in sequential loop nests for running on many-core GPUs. This system uniquely uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by nonexpert application and library developers. CUDA-CHiLL is built on the polyhedral program transformation and code generation framework CHiLL, which is capable of robust composition of transformations while preserving the correctness of the program at each step. Through its use of powerful abstractions and a scripting interface, CUDA-CHiLL allows for a developer to focus on optimization strategies and ignore the error prone details and low level constructs of GPGPU programming. The high level framework can be used inside an orthogonal auto-tuning system that can quickly evaluate the space of possible implementations. Although specicl to CUDA at the moment, many of the abstractions would hold for any GPGPU framework, particularly Open CL. The contributions of this thesis include a programming language approach to providing transformation abstraction and composition, a unifying framework for general and GPU specicl transformations, and demonstration of the framework on standard benchmarks that show it capable of matching or outperforming hand-tuned GPU kernels
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Heterogeneous computing, which combines devices with different architectures,
is rising in popularity, and promises increased performance combined with
reduced energy consumption. OpenCL has been proposed as a standard for
programing such systems, and offers functional portability. It does, however,
suffer from poor performance portability, code tuned for one device must be
re-tuned to achieve good performance on another device. In this paper, we use
machine learning-based auto-tuning to address this problem. Benchmarks are run
on a random subset of the entire tuning parameter configuration space, and the
results are used to build an artificial neural network based model. The model
can then be used to find interesting parts of the parameter space for further
search. We evaluate our method with different benchmarks, on several devices,
including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970
GPU. Our model achieves a mean relative error as low as 6.1%, and is able to
find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the
Proceedings of the 2015 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). For personal use onl
- âŠ