133,239 research outputs found
Air pollution modelling using a graphics processing unit with CUDA
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing.
In the past years the performance and capabilities of GPUs have increased, and
the Compute Unified Device Architecture (CUDA) - a parallel computing
architecture - has been developed by NVIDIA to utilize this performance in
general purpose computations. Here we show for the first time a possible
application of GPU for environmental studies serving as a basement for decision
making strategies. A stochastic Lagrangian particle model has been developed on
CUDA to estimate the transport and the transformation of the radionuclides from
a single point source during an accidental release. Our results show that
parallel implementation achieves typical acceleration values in the order of
80-120 times compared to CPU using a single-threaded implementation on a 2.33
GHz desktop computer. Only very small differences have been found between the
results obtained from GPU and CPU simulations, which are comparable with the
effect of stochastic transport phenomena in atmosphere. The relatively high
speedup with no additional costs to maintain this parallel architecture could
result in a wide usage of GPU for diversified environmental applications in the
near future.Comment: 5 figure
A GPU-based Implementation for Improved Online Rebinning Performance in Clinical 3-D PET
Online rebinning is an important and well-established technique for reducing the time required to process Positron Emission Tomography data. However, the need for efficient data processing in a clinical setting is growing rapidly and is beginning to exceed the capability of traditional online processing methods. High-count rate applications such as Rubidium 3-D PET studies can easily saturate current online rebinning technology. Realtime processing at these high-count rates is essential to avoid significant data loss. In addition, the emergence of time-of-flight (TOF) scanners is producing very large data sets for processing. TOF applications require efficient online Rebinning methods so as to maintain high patient throughput. Currently, new hardware architectures such as Graphics Processing Units (GPUs) are available to speedup data parallel and number crunching algorithms. In comparison to the usual parallel systems, such as multiprocessor or clustered machines, GPU hardware can be much faster and above all, it is significantly cheaper. The GPUs have been primarily delivered for graphics for video games but are now being used for High Performance computing across many domains. The goal of this thesis is to investigate the suitability of the GPU for PET rebinning algorithms
On the testing of special memories in GPGPUs
Nowadays, data-intensive processing applications, such as multimedia, high-performance computing and safety-critical ones (e.g., in automotive) employ General Purpose Graphics Processing Units (GPGPUs) due to their parallel processing capabilities and high performance. In these devices, multiple levels of memories are employed in GPGPUs to hide latency and increase the performance during the operation of a kernel. Moreover, modern GPGPU architectures implement cutting-edge semiconductor technologies, reducing their size and power consumption. However, some studies proved that these technologies are prone to faults during the operative life of a device, so compromising reliability. In this work, we developed functional test techniques based on parallel Software-Based Self-Test routines to test memory structures in the memory hierarchy of a GPGPU (FlexGripPlus) implementing the G80 architecture of Nvidia
Inter-workgroup barrier synchronisation on graphics processing units
GPUs are parallel devices that are able to run thousands of
independent threads concurrently. Traditional GPU programs are
data-parallel, requiring little to no communication,
i.e. synchronisation, between threads. However, classical concurrency
in the context of CPUs often exploits synchronisation idioms that are
not supported on GPUs. By studying such idioms on GPUs, with an aim to
facilitate them in a portable way, a wider and more generic space of
GPU applications can be made possible.
While the breadth of this thesis extends to many aspects of GPU
systems, the common thread throughout is the global barrier: an
execution barrier that synchronises all threads executing a GPU
application. The idea of such a barrier might seem straightforward,
however this investigation reveals many challenges and insights. In
particular, this thesis includes the following studies:
Execution models: while a general global barrier can deadlock due to
starvation on GPUs, it is shown that the scheduling guarantees of
current GPUs can be used to dynamically create an execution
environment that allows for a safe and portable global barrier
across a subset of the GPU threads.
Application optimisations: a set GPU optimisations are examined that
are tailored for graph applications, including one optimisation
enabled by the global barrier. It is shown that these optimisations
can provided substantial performance improvements, e.g. the barrier
optimisation achieves over a 10X speedup on AMD and Intel GPUs. The
performance portability of these optimisations is investigated, as
their utility varies across input, application, and architecture.
Multitasking: because many GPUs do not support preemption,
long-running GPU compute tasks (e.g. applications that use the
global barrier) may block other GPU functions, including graphics. A
simple cooperative multitasking scheme is proposed that allows
graphics tasks to meet their deadlines with reasonable overheads.Open Acces
DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives
We present a new parallel algorithm for probabilistic graphical model
optimization. The algorithm relies on data-parallel primitives (DPPs), which
provide portable performance over hardware architecture. We evaluate results on
CPUs and GPUs for an image segmentation problem. Compared to a serial baseline,
we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare
our performance to a reference, OpenMP-based algorithm, and find speedups of up
to 7X (CPU).Comment: LDAV 2018, October 201
- …