38 research outputs found
Domain-Specific Computing Architectures and Paradigms
We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud.
This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking.
The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers.
Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
Full Stack Optimization of Transformer Inference: a Survey
Recent advances in state-of-the-art DNN architecture design have been moving
toward Transformer models. These models achieve superior accuracy across a wide
range of applications. This trend has been consistent over the past several
years since Transformer models were originally introduced. However, the amount
of compute and bandwidth required for inference of recent Transformer models is
growing at a significant rate, and this has made their deployment in
latency-sensitive applications challenging. As such, there has been an
increased focus on making Transformer models more efficient, with methods that
range from changing the architecture design, all the way to developing
dedicated domain-specific accelerators. In this work, we survey different
approaches for efficient Transformer inference, including: (i) analysis and
profiling of the bottlenecks in existing Transformer architectures and their
similarities and differences with previous convolutional models; (ii)
implications of Transformer architecture on hardware, including the impact of
non-linear operations such as Layer Normalization, Softmax, and GELU, as well
as linear operations, on hardware design; (iii) approaches for optimizing a
fixed Transformer architecture; (iv) challenges in finding the right mapping
and scheduling of operations for Transformer models; and (v) approaches for
optimizing Transformer models by adapting the architecture using neural
architecture search. Finally, we perform a case study by applying the surveyed
optimizations on Gemmini, the open-source, full-stack DNN accelerator
generator, and we show how each of these approaches can yield improvements,
compared to previous benchmark results on Gemmini. Among other things, we find
that a full-stack co-design approach with the aforementioned methods can result
in up to 88.7x speedup with a minimal performance degradation for Transformer
inference
Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
DNN accelerators are often developed and evaluated in isolation without
considering the cross-stack, system-level effects in real-world environments.
This makes it difficult to appreciate the impact of System-on-Chip (SoC)
resource contention, OS overheads, and programming-stack inefficiencies on
overall performance/energy-efficiency. To address this challenge, we present
Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini
generates a wide design-space of efficient ASIC accelerators from a flexible
architectural template, together with flexible programming stacks and full SoCs
with shared resources that capture system-level effects. Gemmini-generated
accelerators have also been fabricated, delivering up to three
orders-of-magnitude speedups over high-performance CPUs on various DNN
benchmarks.
* https://github.com/ucb-bar/gemminiComment: To appear at the 58th IEEE/ACM Design Automation Conference (DAC),
December 2021, San Francisco, CA, US
Performance optimization of convolution calculation by blocking and sparsity on GPU
Convolution neural network (CNN) plays a paramount role in machine learning,
which has made significant contributions in medical image classification,
natural language processing, recommender system and so on. A successful
convolution neural network can achieve excellent performance with fast
execution time. The convolution operation dominates the total operation time of
convolution neural network. Therefore, in this paper, we propose a novel
convolution method on Graphic Processing Units (GPUs), which reduces the
convolution operation time and improves the execution speed by approximately 2X
than the state of the art convolution algorithm. Our work is based on the
observation that the sparsity of the input feature map of convolution operation
is relatively large, and the zero value of the feature map is redundancy for
convolution result. Therefore, we skip the zero value calculation and improve
the speed by compressing the feature map. Besides, the shape of the feature map
for the deep network is small, and the number of threads is limited. Therefore,
for a limited number of threads, it is necessary to reduce the amount of
calculation to increase the calculation speed. Our algorithm has a good effect
on the convolution operation for the feature map of the deep network with large
sparsity and small size
Compiler-centric across-stack deep learning acceleration
Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a holistic approach, including a range of topics from machine learning, systems, and hardware. The rapid proliferation of deep learning applications has raised demand for efficient exploration and acceleration of deep learning based solutions. However, managing the range of optimization techniques, as well as how they interact with each other across the stack is a non-trivial task. A family of emerging specialized compilers for deep learning, tensor compilers, appear to be a strong candidate to help manage the complexity of across-stack optimization choices, and enable new approaches.
This thesis presents new techniques and explorations of the Deep Learning Acceleration Stack (DLAS), with the perspective that the tensor compiler will increasingly be the center of this stack. First, we motivate the challenges in exploring DLAS, by describing the experience of running a perturbation study varying parameters at every layer of the stack. The core of the study is implemented using a tensor compiler, which reduces the complexity of evaluating the wide range of variants, although still requires a significant engineering effort to realize. Next, we develop a new algorithm for grouped convolution, a model optimization technique for which existing solutions provided poor inference time scaling. We implement and optimize our algorithm using a tensor compiler, outperforming existing approaches by 5.1× on average (arithmetic mean). Finally, we propose a technique, transfer-tuning, to reduce the search time required for automatic tensor compiler code optimization, reducing the search time required by 6.5× on average.
The techniques and contributions of this thesis across these interconnected domains demonstrate the exciting potential of tensor compilers to simplify and improve design space exploration for DNNs, and their deployment. The outcomes of this thesis enable new lines of research to enable machine learning developers to keep up with the rapidly evolving landscape of neural architectures and hardware
TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators
Over the past few years, the explosion in sparse tensor algebra workloads has
led to a corresponding rise in domain-specific accelerators to service them.
Due to the irregularity present in sparse tensors, these accelerators employ a
wide variety of novel solutions to achieve good performance. At the same time,
prior work on design-flexible sparse accelerator modeling does not express this
full range of design features, making it difficult to understand the impact of
each design choice and compare or extend the state-of-the-art.
To address this, we propose TeAAL: a language and compiler for the concise
and precise specification and evaluation of sparse tensor algebra
architectures. We use TeAAL to represent and evaluate four disparate
state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and
verify that it reproduces their performance with high accuracy. Finally, we
demonstrate the potential of TeAAL as a tool for designing new accelerators by
showing how it can be used to speed up Graphicionado--by on BFS and
on SSSP.Comment: 14 pages, 12 figure