4,723 research outputs found
Contract-Based General-Purpose GPU Programming
Using GPUs as general-purpose processors has revolutionized parallel
computing by offering, for a large and growing set of algorithms, massive
data-parallelization on desktop machines. An obstacle to widespread adoption,
however, is the difficulty of programming them and the low-level control of the
hardware required to achieve good performance. This paper suggests a
programming library, SafeGPU, that aims at striking a balance between
programmer productivity and performance, by making GPU data-parallel operations
accessible from within a classical object-oriented programming language. The
solution is integrated with the design-by-contract approach, which increases
confidence in functional program correctness by embedding executable program
specifications into the program text. We show that our library leads to modular
and maintainable code that is accessible to GPGPU non-experts, while providing
performance that is comparable with hand-written CUDA code. Furthermore,
runtime contract checking turns out to be feasible, as the contracts can be
executed on the GPU
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconļ¬gurable computing has grown from a concept to a mature ļ¬eld of science and technology. The cornerstone of this evolution is the ļ¬eld programmable gate array, a building block enabling the conļ¬guration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural reļ¬nements
Vector-Processing for Mobile Devices: Benchmark and Analysis
Vector processing has become commonplace in today's CPU microarchitectures.
Vector instructions improve performance and energy which is crucial for
resource-constraint mobile devices. The research community currently lacks a
comprehensive benchmark suite to study the benefits of vector processing for
mobile devices. This paper presents Swan-an extensive vector processing
benchmark suite for mobile applications. Swan consists of a diverse set of
data-parallel workloads from four commonly used mobile applications: operating
system, web browser, audio/video messaging application, and PDF rendering
engine. Using Swan benchmark suite, we conduct a detailed analysis of the
performance, power, and energy consumption of vectorized workloads, and show
that: (a) Vectorized kernels increase the pressure on cache hierarchy due to
the higher rate of memory requests. (b) Vector processing is more beneficial
for workloads with lower precision operations and higher cache hit rates. (c)
Limited Instruction-Level Parallelism and strided memory accesses to
multi-dimensional data structures prevent vector processing benefits from
scaling with more SIMD functional units and wider registers. (d) Despite lower
computation throughput than domain-specific accelerators, such as GPU, vector
processing outperforms these accelerators for kernels with lower operation
counts. Finally, we show five common computation patterns in mobile
data-parallel workloads that dominate the execution time.Comment: 2023 IEEE International Symposium on Workload Characterization
(IISWC
State-of-the-Art in Parallel Computing with R
R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix
- ā¦