3 research outputs found
A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels
This paper proposes a versatile high-performance execution model, inspired by
systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs.
We formulate a systolic model that shifts partial sums by CUDA warp primitives
for the computation. We also employ register files as a cache resource in order
to operate the entire model efficiently. We demonstrate the effectiveness and
versatility of the proposed model for a wide variety of stencil kernels that
appear commonly in HPC, and also convolution kernels (increasingly important in
deep learning workloads). Our algorithm outperforms the top reported
state-of-the-art stencil implementations, including implementations with
sophisticated temporal and spatial blocking techniques, on the two latest
Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter
sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on
V100 and P100 GPUs.Comment: ACM/IEEE Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis (SC'19
Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs
Domain-specific languages that execute image processing pipelineson GPUs,
such as Halide and Forma, operate by 1) dividing the image into overlapped
tiles, and 2) fusing loops to improve memory locality. However, current
approaches have limitations: 1) they require intra thread block
synchronization, which has a non-trivial cost, 2) they must choose between
small tiles that require more overlapped computations or large tiles that
increase shared memory access (and lowers occupancy), and 3) their
autoscheduling algorithms use simplified GPU models that can result in
inefficient global memory accesses. We present a new approach for executing
image processing pipelines on GPUs that addresses these limitations as follows.
1) We fuse loops to form overlapped tiles that fit in a single warp, which
allows us to use lightweight warp synchronization. 2) We introduce hybrid
tiling, which stores overlapped regions in a combination of thread-local
registers and shared memory. Thus hybrid tiling either increases occupancy by
decreasing shared memory usage or decreases overlapping computations using
larger tiles. 3) We present an automatic loop fusion algorithm that considers
several factors that affect the performance of GPU kernels. We implement these
techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our
approach produces code that is faster than Halide's manual schedules: 1.65x
faster on an NVIDIA GTX 1080Ti and 1.33 faster on an NVIDIA Tesla V100
Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs
Despite foreseeing tremendous speedups over conventional deep neural
networks, the performance advantage of binarized neural networks (BNNs) has
merely been showcased on general-purpose processors such as CPUs and GPUs. In
fact, due to being unable to leverage bit-level-parallelism with a word-based
architecture, GPUs have been criticized for extremely low utilization (1%) when
executing BNNs. Consequently, the latest tensorcores in NVIDIA Turing GPUs
start to experimentally support bit computation. In this work, we look into
this brand new bit computation capability and characterize its unique features.
We show that the stride of memory access can significantly affect performance
delivery and a data-format co-design is highly desired to support the
tensorcores for achieving superior performance than existing software solutions
without tensorcores. We realize the tensorcore-accelerated BNN design,
particularly the major functions for fully-connect and convolution layers --
bit matrix multiplication and bit convolution. Evaluations on two NVIDIA Turing
GPUs show that, with ResNet-18, our BTC-BNN design can process ImageNet at a
rate of 5.6K images per second, 77% faster than state-of-the-art. Our BNN
approach is released on https://github.com/pnnl/TCBNN.Comment: This work has already been accepted by IEEE Transactions on Parallel
and Distributed Systems (TPDS) Special Section on Parallel and Distributed
Computing Techniques for AI/ML/D