3,789 research outputs found
Programming GPUs with CUDA
El documento contiene el material de un tutorial impartido en el congreso. No es una artĂculo cientĂfico en formato tradicional.Analizamos las prestaciones y caracterĂsticas de las distintas generaciones de procesadores gráficos desarrollados por Nvidia para la programaciĂłn de aplicaciones de propĂłsito general bajo CUDA, un paradigma nuevo que aporta en la vertiente hardware y software de forma simultánea.Universidad de Málaga, Campus de Excelencia Internacional AndalucĂa Tec
QCD simulations with staggered fermions on GPUs
We report on our implementation of the RHMC algorithm for the simulation of
lattice QCD with two staggered flavors on Graphics Processing Units, using the
NVIDIA CUDA programming language. The main feature of our code is that the GPU
is not used just as an accelerator, but instead the whole Molecular Dynamics
trajectory is performed on it. After pointing out the main bottlenecks and how
to circumvent them, we discuss the obtained performances. We present some
preliminary results regarding OpenCL and multiGPU extensions of our code and
discuss future perspectives.Comment: 22 pages, 14 eps figures, final version to be published in Computer
Physics Communication
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference
Vision Transformers (ViTs) have achieved state-of-the-art performance on
various computer vision applications. These models, however, have considerable
storage and computational overheads, making their deployment and efficient
inference on edge devices challenging. Quantization is a promising approach to
reducing model complexity; unfortunately, existing efforts to quantize ViTs are
simulated quantization (aka fake quantization), which remains floating-point
arithmetic during inference and thus contributes little to model acceleration.
In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs,
to enable ViTs to perform the entire computational graph of inference with
integer operations and bit-shifting and no floating-point operations. In I-ViT,
linear operations (e.g., MatMul and Dense) follow the integer-only pipeline
with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and
LayerNorm) are approximated by the proposed light-weight integer-only
arithmetic methods. In particular, I-ViT applies the proposed Shiftmax and
ShiftGELU, which are designed to use integer bit-shifting to approximate the
corresponding floating-point operations. We evaluate I-ViT on various benchmark
models and the results show that integer-only INT8 quantization achieves
comparable (or even higher) accuracy to the full-precision (FP) baseline.
Furthermore, we utilize TVM for practical hardware deployment on the GPU's
integer arithmetic units, achieving 3.72~4.11 inference speedup
compared to the FP model
Gaucho Banking Redux
Argentina's economic crisis has strong similarities with previous crises stretching back to the nineteenth century. A common thread runs through all these crises: the interaction of a weak, undisciplined, or corruptible banking sector, and some other group of conspirators from the public or private sector that hasten its collapse. This pampean propensity for crony finance was dubbed gaucho banking' more than one hundred years ago. What happens when such a rotten structure interacts with a convertibility plan? We compare the 1929 and 2001 crises the two instances where rigid convertibility plans failed and reach two main conclusions. First, a seemingly robust currency-board can be devastated by an ill-conceived approach to the problems of internal and external convertibility (or, to rephrase Gresham, bad inside money drives out good outside money'). Second, when modern economic orthodoxy collides with caudillo-style institutional backwardness, a desperate regime with its hands tied in both monetary and fiscal domains will be sorely tempted by a capital levy' on the financial sector (for, as Willie Sutton said when asked why he robbed banks, because that's where the money is).
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
- …