54 research outputs found
Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering
The resurgence of machine learning has increased the demand for
high-performance basic linear algebra subroutines (BLAS), which have long
depended on libraries to achieve peak performance on commodity hardware.
High-performance BLAS implementations rely on a layered approach that consists
of tiling and packing layers, for data (re)organization, and micro kernels that
perform the actual computations. The creation of high-performance micro kernels
requires significant development effort to write tailored assembly code for
each architecture. This hand optimization task is complicated by the recent
introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to
deliver high-performance matrix operations. This paper presents a compiler-only
alternative to the use of high-performance libraries by incorporating, to the
best of our knowledge and for the first time, the automatic generation of the
layered approach into LLVM, a production compiler. Modular design of the
algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear
interface between the tiling and packing layers and the micro kernel, makes it
easy to retarget the code generation to multiple accelerators. The use of
intrinsics enables a comprehensive performance study. In processors without
hardware matrix engines, the tiling and packing delivers performance up to 22x
(Intel), for small matrices, and more than 6x (POWER9), for large matrices,
faster than PLuTo, a widely used polyhedral optimizer. The performance also
approaches high-performance libraries and is only 34% slower than OpenBLAS and
on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for
large matrices, over 2.6x faster than the vector-extension solution, matches
Eigen performance, and achieves up to 96% of BLAS peak performance
Accelerating Reduction and Scan Using Tensor Core Units
Driven by deep learning, there has been a surge of specialized processors for
matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are
capable of performing matrix multiplications on small matrices (usually 4x4 or
16x16) to accelerate the convolutional and recurrent neural networks in deep
learning workloads. In this paper we leverage NVIDIA's TCU to express both
reduction and scan with matrix multiplication and show the benefits -- in terms
of program simplicity, efficiency, and performance. Our algorithm exercises the
NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy
bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x
for scan) than state-of-the-art methods for small segment sizes -- common in
machine learning and scientific applications. Our algorithm achieves this while
decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing
(ICS '19
Compiler-centric across-stack deep learning acceleration
Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a holistic approach, including a range of topics from machine learning, systems, and hardware. The rapid proliferation of deep learning applications has raised demand for efficient exploration and acceleration of deep learning based solutions. However, managing the range of optimization techniques, as well as how they interact with each other across the stack is a non-trivial task. A family of emerging specialized compilers for deep learning, tensor compilers, appear to be a strong candidate to help manage the complexity of across-stack optimization choices, and enable new approaches.
This thesis presents new techniques and explorations of the Deep Learning Acceleration Stack (DLAS), with the perspective that the tensor compiler will increasingly be the center of this stack. First, we motivate the challenges in exploring DLAS, by describing the experience of running a perturbation study varying parameters at every layer of the stack. The core of the study is implemented using a tensor compiler, which reduces the complexity of evaluating the wide range of variants, although still requires a significant engineering effort to realize. Next, we develop a new algorithm for grouped convolution, a model optimization technique for which existing solutions provided poor inference time scaling. We implement and optimize our algorithm using a tensor compiler, outperforming existing approaches by 5.1× on average (arithmetic mean). Finally, we propose a technique, transfer-tuning, to reduce the search time required for automatic tensor compiler code optimization, reducing the search time required by 6.5× on average.
The techniques and contributions of this thesis across these interconnected domains demonstrate the exciting potential of tensor compilers to simplify and improve design space exploration for DNNs, and their deployment. The outcomes of this thesis enable new lines of research to enable machine learning developers to keep up with the rapidly evolving landscape of neural architectures and hardware
gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs
As the interest to Graph Neural Networks (GNNs) is growing, the importance of
benchmarking and performance characterization studies of GNNs is increasing. So
far, we have seen many studies that investigate and present the performance and
computational efficiency of GNNs. However, the work done so far has been
carried out using a few high-level GNN frameworks. Although these frameworks
provide ease of use, they contain too many dependencies to other existing
libraries. The layers of implementation details and the dependencies complicate
the performance analysis of GNN models that are built on top of these
frameworks, especially while using architectural simulators. Furthermore,
different approaches on GNN computation are generally overlooked in prior
characterization studies, and merely one of the common computational models is
evaluated. Based on these shortcomings and needs that we observed, we developed
a benchmark suite that is framework independent, supporting versatile
computational models, easily configurable and can be used with architectural
simulators without additional effort.
Our benchmark suite, which we call gSuite, makes use of only hardware
vendor's libraries and therefore it is independent of any other frameworks.
gSuite enables performing detailed performance characterization studies on GNN
Inference using both contemporary GPU profilers and architectural GPU
simulators. To illustrate the benefits of our new benchmark suite, we perform a
detailed characterization study with a set of well-known GNN models with
various datasets; running gSuite both on a real GPU card and a timing-detailed
GPU simulator. We also implicate the effect of computational models on
performance. We use several evaluation metrics to rigorously measure the
performance of GNN computation.Comment: IEEE International Symposium on Workload Characterization (IISWC)
202
- …