Recent advances in state-of-the-art DNN architecture design have been moving
toward Transformer models. These models achieve superior accuracy across a wide
range of applications. This trend has been consistent over the past several
years since Transformer models were originally introduced. However, the amount
of compute and bandwidth required for inference of recent Transformer models is
growing at a significant rate, and this has made their deployment in
latency-sensitive applications challenging. As such, there has been an
increased focus on making Transformer models more efficient, with methods that
range from changing the architecture design, all the way to developing
dedicated domain-specific accelerators. In this work, we survey different
approaches for efficient Transformer inference, including: (i) analysis and
profiling of the bottlenecks in existing Transformer architectures and their
similarities and differences with previous convolutional models; (ii)
implications of Transformer architecture on hardware, including the impact of
non-linear operations such as Layer Normalization, Softmax, and GELU, as well
as linear operations, on hardware design; (iii) approaches for optimizing a
fixed Transformer architecture; (iv) challenges in finding the right mapping
and scheduling of operations for Transformer models; and (v) approaches for
optimizing Transformer models by adapting the architecture using neural
architecture search. Finally, we perform a case study by applying the surveyed
optimizations on Gemmini, the open-source, full-stack DNN accelerator
generator, and we show how each of these approaches can yield improvements,
compared to previous benchmark results on Gemmini. Among other things, we find
that a full-stack co-design approach with the aforementioned methods can result
in up to 88.7x speedup with a minimal performance degradation for Transformer
inference