FLAT: An Optimized Dataflow for Mitigating Attention Performance
  Bottlenecks

Agrawal, Gaurav; Kao, Sheng-Chun; Krishna, Tushar; Subramanian, Suvinay

FLAT: An Optimized Dataflow for Mitigating Attention Performance Bottlenecks

Authors: Gaurav Agrawal
Sheng-Chun Kao
Tushar Krishna
Suvinay Subramanian
Publication date: 3 December 2021
Publisher

Abstract

Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences, as this work identifies. This is due to operators in attention layers exhibiting limited reuse opportunities and quadratic growth in memory footprint, leading to severe memory-boundedness. To address this, we introduce a new attention-tailored dataflow, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on-chip memory-aware interleaved execution and tiling mechanism. FLAT increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction comparing to baseline execution over state-of-the-art edge and cloud accelerators

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2107.06419

Last time updated on 04/08/2021