2 research outputs found
Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration
The research interest in specialized hardware accelerators for deep neural
networks (DNN) spikes recently owing to their superior performance and
efficiency. However, today's DNN accelerators primarily focus on accelerating
specific "kernels" such as convolution and matrix multiplication, which are
vital but only part of an end-to-end DNN-enabled application. Meaningful
speedups over the entire application often require supporting computations that
are, while massively parallel, ill-suited to DNN accelerators. Integrating a
general-purpose processor such as a CPU or a GPU incurs significant data
movement overhead and leads to resource under-utilization on the DNN
accelerators.
We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture
design and execution model that offers general-purpose programmability on DNN
accelerators in order to accelerate end-to-end applications. The key to SMA is
the temporal integration of the systolic execution model with the GPU-like SIMD
execution model. The SMA exploits the common components shared between the
systolic-array accelerator and the GPU, and provides lightweight
reconfiguration capability to switch between the two modes in-situ. The SMA
achieves up to 63% performance improvement while consuming 23% less energy than
the baseline Volta architecture with TensorCore.Comment: 6 pages, 9 figures, DAC 202
Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity
Network pruning can reduce the high computation cost of deep neural network
(DNN) models. However, to maintain their accuracies, sparse models often carry
randomly-distributed weights, leading to irregular computations. Consequently,
sparse models cannot achieve meaningful speedup on commodity hardware (e.g.,
GPU) built for dense matrix computations. As such, prior works usually modify
or design completely new sparsity-optimized architectures for exploiting
sparsity. We propose an algorithm-software co-designed pruning method that
achieves latency speedups on existing dense architectures. Our work builds upon
the insight that the matrix multiplication generally breaks the large matrix
into multiple smaller tiles for parallel execution. We propose a
tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern
at the tile level for efficient execution but allows for irregular, arbitrary
pruning at the global scale to maintain the high accuracy. We implement and
evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup
over the dense model.Comment: 12pages, ACM/IEEE Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis (SC20