19,900 research outputs found
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
International audienceHeterogeneous computing system increases the performance of parallel computing in many domain of general purpose computing with CPU, GPU and other accelerators. With Hardware developments, the software developments like Compute Unified Device Architecture(CUDA) and Open Computing Language (OpenCL) try to offer a simple and visualized tool for parallel computing. But it turn out to be more difficult than programming on CPU platform for optimization of performance. For one kind of parallel computing application, there are different configuration and parameters for various hardware platforms. In this paper, we apply the Hybrid Multi-cores Parallel Programming(HMPP) to automatic-generates tunable code for GPU platform and show the result of implementation of Stereo Matching with detailed comparison with C code version and manual CUDA version. The experimental results show that the default and optimized HMPP have the approximative 1 compared with CUDA implementation. And the HMPP workbench can greatly reduce the time of application development using parallel computing device
GPGPU Processing in CUDA Architecture
The future of computation is the Graphical Processing Unit, i.e. the GPU. The
promise that the graphics cards have shown in the field of image processing and
accelerated rendering of 3D scenes, and the computational capability that these
GPUs possess, they are developing into great parallel computing units. It is
quite simple to program a graphics processor to perform general parallel tasks.
But after understanding the various architectural aspects of the graphics
processor, it can be used to perform other taxing tasks as well. In this paper,
we will show how CUDA can fully utilize the tremendous power of these GPUs.
CUDA is NVIDIA's parallel computing architecture. It enables dramatic increases
in computing performance, by harnessing the power of the GPU. This paper talks
about CUDA and its architecture. It takes us through a comparison of CUDA C/C++
with other parallel programming languages like OpenCL and DirectCompute. The
paper also lists out the common myths about CUDA and how the future seems to be
promising for CUDA.Comment: 16 pages, 5 figures, Advanced Computing: an International Journal
(ACIJ) 201
CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures
The use of graphics processing units (GPUs) in
high-performance parallel computing continues to become more
prevalent, often as part of a heterogeneous system. For years,
CUDA has been the de facto programming environment for
nearly all general-purpose GPU (GPGPU) applications. In spite
of this, the framework is available only on NVIDIA GPUs,
traditionally requiring reimplementation in other frameworks
in order to utilize additional multi- or many-core devices.
On the other hand, OpenCL provides an open and vendorneutral
programming environment and runtime system. With
implementations available for CPUs, GPUs, and other types of
accelerators, OpenCL therefore holds the promise of a “write
once, run anywhere” ecosystem for heterogeneous computing.
Given the many similarities between CUDA and OpenCL,
manually porting a CUDA application to OpenCL is typically
straightforward, albeit tedious and error-prone. In response
to this issue, we created CU2CL, an automated CUDA-to-
OpenCL source-to-source translator that possesses a novel design
and clever reuse of the Clang compiler framework. Currently,
the CU2CL translator covers the primary constructs found in
CUDA runtime API, and we have successfully translated many
applications from the CUDA SDK and Rodinia benchmark suite.
The performance of our automatically translated applications via
CU2CL is on par with their manually ported countparts
Air pollution modelling using a graphics processing unit with CUDA
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing.
In the past years the performance and capabilities of GPUs have increased, and
the Compute Unified Device Architecture (CUDA) - a parallel computing
architecture - has been developed by NVIDIA to utilize this performance in
general purpose computations. Here we show for the first time a possible
application of GPU for environmental studies serving as a basement for decision
making strategies. A stochastic Lagrangian particle model has been developed on
CUDA to estimate the transport and the transformation of the radionuclides from
a single point source during an accidental release. Our results show that
parallel implementation achieves typical acceleration values in the order of
80-120 times compared to CPU using a single-threaded implementation on a 2.33
GHz desktop computer. Only very small differences have been found between the
results obtained from GPU and CPU simulations, which are comparable with the
effect of stochastic transport phenomena in atmosphere. The relatively high
speedup with no additional costs to maintain this parallel architecture could
result in a wide usage of GPU for diversified environmental applications in the
near future.Comment: 5 figure
cudaBayesreg: Parallel Implementation of a Bayesian Multilevel Model for fMRI Data Analysis
Graphic processing units (GPUs) are rapidly gaining maturity as powerful general parallel computing devices. A key feature in the development of modern GPUs has been the advancement of the programming model and programming tools. Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on Nvidia many-core GPUs. In functional magnetic resonance imaging (fMRI), the volume of the data to be processed, and the type of statistical analysis to perform call for high-performance computing strategies. In this work, we present the main features of the R-CUDA package cudaBayesreg which implements in CUDA the core of a Bayesian multilevel model for the analysis of brain fMRI data. The statistical model implements a Gibbs sampler for multilevel/hierarchical linear models with a normal prior. The main contribution for the increased performance comes from the use of separate threads for fitting the linear regression model at each voxel in parallel. The R-CUDA implementation of the Bayesian model proposed here has been able to reduce significantly the run-time processing of Markov chain Monte Carlo (MCMC) simulations used in Bayesian fMRI data analyses. Presently, cudaBayesreg is only configured for Linux systems with Nvidia CUDA support
- …