1,395 research outputs found
Domain-aware Genetic Algorithms for Hardware and Mapping Optimization for Efficient DNN Acceleration
The proliferation of AI across a variety of domains (vision, language, speech, recommendations, games) has led to the rise of domain-specific accelerators for deep learning. At design-time, these accelerators carefully architect the on-chip dataflow to maximize data reuse (over space and time) and size the hardware resources (PEs and buffers) to maximize performance and energy-efficiency, while meeting the chip’s area and power targets. At compile-time, the target Deep Neural Network (DNN) model is mapped over the accelerator. The mapping refers to tiling the computation and data (i.e., tensors) and scheduling them over the PEs and scratchpad buffers respectively, while honoring the microarchitectural constraints (number of PEs, buffer sizes, and dataflow).
The design-space of valid hardware resource assignments for a given dataflow and the valid mappings for a given hardware is extremely large (~O(10^24)) per layer for state-of-the-art DNN models today. This makes exhaustive searches infeasible. Unfortunately, there can be orders of magnitude performance and energy-efficiency differences between an optimal and sub-optimal choice, making these decisions a crucial part of the entire design process. Moreover, manual tuning by domain experts become unprecedentedly challenged due to increased irregularity (due to neural architecture search) and sparsity of DNN models. This necessitate the existence of Map Space Exploration (MSE). In this thesis, our goal is to deliver a deep analysis of the MSE for DNN accelerators, propose different techniques to improve MSE, and generalize the MSE framework to a wider landscape (from mapping to HW-mapping co-exploration, from single-accelerator to multi-accelerator scheduling). As part of it, we discuss the correlation between hardware flexibility and the formed map space, formalized the map space representation by four mapping axes: tile, order, parallelism, and shape. Next, we develop dedicated exploration operators for these axes and use genetic algorithm framework to converge the solution. Next, we develop "sparsity-aware" technique to enable sparsity consideration in MSE and a "warm-start" technique to solve the search speed challenge commonly seen across learning-based search algorithms. Finally, we extend out MSE to support hardware and map space co-exploration and multi-accelerator scheduling.Ph.D
Demystifying Map Space Exploration for NPUs
Map Space Exploration is the problem of finding optimized mappings of a Deep
Neural Network (DNN) model on an accelerator. It is known to be extremely
computationally expensive, and there has been active research looking at both
heuristics and learning-based methods to make the problem computationally
tractable. However, while there are dozens of mappers out there (all
empirically claiming to find better mappings than others), the research
community lacks systematic insights on how different search techniques navigate
the map-space and how different mapping axes contribute to the accelerator's
performance and efficiency. Such insights are crucial to developing mapping
frameworks for emerging DNNs that are increasingly irregular (due to neural
architecture search) and sparse, making the corresponding map spaces much more
complex. In this work, rather than proposing yet another mapper, we do a
first-of-its-kind apples-to-apples comparison of search techniques leveraged by
different mappers. Next, we extract the learnings from our study and propose
two new techniques that can augment existing mappers -- warm-start and
sparsity-aware -- that demonstrate speedups, scalability, and robustness across
diverse DNN models
FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
Attention mechanisms form the backbone of state-of-the-art machine learning
models for a variety of tasks. Deploying them on deep neural network (DNN)
accelerators, however, is prohibitively challenging especially under long
sequences, as this work identifies. This is due to operators in attention
layers exhibiting limited reuse opportunities and quadratic growth in memory
footprint, leading to severe memory-boundedness. To address this, we introduce
a new attention-tailored dataflow, termed FLAT, which identifies fusion
opportunities within the attention layer, and implements an on-chip
memory-aware interleaved execution and tiling mechanism. FLAT increases the
effective memory bandwidth by efficiently utilizing the high-bandwidth,
low-capacity on-chip buffer and thus achieves better run time and compute
resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup
and 49% and 42% of energy reduction comparing to baseline execution over
state-of-the-art edge and cloud accelerators
Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Sparsity has become one of the promising methods to compress and accelerate
Deep Neural Networks (DNNs). Among different categories of sparsity, structured
sparsity has gained more attention due to its efficient execution on modern
accelerators. Particularly, N:M sparsity is attractive because there are
already hardware accelerator architectures that can leverage certain forms of
N:M structured sparsity to yield higher compute-efficiency. In this work, we
focus on N:M sparsity and extensively study and evaluate various training
recipes for N:M sparsity in terms of the trade-off between model accuracy and
compute cost (FLOPs). Building upon this study, we propose two new decay-based
pruning methods, namely "pruning mask decay" and "sparse structure decay". Our
evaluations indicate that these proposed methods consistently deliver
state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on
a Transformer-based model for a translation task. The increase in the accuracy
of the sparse model using the new training recipes comes at the cost of
marginal increase in the total training compute (FLOPs).Comment: 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on
Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First
two authors contributed equall
The Liquid Sensor Using Thin Film Bulk Acoustic Resonator with C-Axis Tilted AlN Films
Dual-mode thin film bulk acoustic resonator (TFBAR) devices are fabricated with c-axis tilted AlN films. To fabricate dual-mode TFBAR devices, the off-axis RF magnetron sputtering method for the growth of tilted piezoelectric AlN thin films is adopted. In this report, the AlN thin films are deposited with tilting angles of 15° and 23°. The frequency response of the TFBAR device with 23° tilted AlN thin film is measured to reveal its ability to provide dual-mode resonance. The sensitivities of the longitudinal and shear modes to mass loading are calculated to be 2295 Hz cm2/ng and 1363 Hz cm2/ng with the mechanical quality factors of 480 and 287, respectively. The sensitivities of the longitudinal and shear modes are calculated to be 0 and 15 Hz cm2/μg for liquid loading
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
N:M Structured sparsity has garnered significant interest as a result of
relatively modest overhead and improved efficiency. Additionally, this form of
sparsity holds considerable appeal for reducing the memory footprint owing to
their modest representation overhead. There have been efforts to develop
training recipes for N:M structured sparsity, they primarily focus on
low-sparsity regions (50\%). Nonetheless, performance of models trained
using these approaches tends to decline when confronted with high-sparsity
regions (80\%). In this work, we study the effectiveness of existing sparse
training recipes at \textit{high-sparsity regions} and argue that these methods
fail to sustain the model quality on par with low-sparsity regions. We
demonstrate that the significant factor contributing to this disparity is the
presence of elevated levels of induced noise in the gradient magnitudes. To
mitigate this undesirable effect, we employ decay mechanisms to progressively
restrict the flow of gradients towards pruned elements. Our approach improves
the model quality by up to 2 and 5 in vision and language models at
high sparsity regime, respectively. We also evaluate the trade-off between
model accuracy and training compute cost in terms of FLOPs. At iso-training
FLOPs, our method yields better performance compared to conventional sparse
training recipes, exhibiting an accuracy improvement of up to 2. The source
code is available at
https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.Comment: 18 pages, 8 figures, 17 tables. Code is available at
https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsit
- …