Easy and Efficient Transformer : Scalable Inference Solution For large
  NLP model

Ding, Jingzhen; Fan, Changjie; Li, Gongzheng; Liu, Bai; Mao, Xiaoxi; Wang, Duan; Xi, Yadong; Zhao, Zeng

Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

Authors: Jingzhen Ding
Changjie Fan
Gongzheng Li
Bai Liu
Xiaoxi Mao
Duan Wang
Yadong Xi
Zeng Zhao
Publication date: 23 November 2021
Publisher

Abstract

Recently, large-scale transformer-based models have been proven to be effective over a variety of tasks across many domains. Nevertheless, putting them into production is very expensive, requiring comprehensive optimization techniques to reduce inference costs. This paper introduces a series of transformer inference optimization techniques that are both in algorithm level and hardware level. These techniques include a pre-padding decoding mechanism that improves token parallelism for text generation, and highly optimized kernels designed for very long input length and large hidden size. On this basis, we propose a transformer inference acceleration library -- Easy and Efficient Transformer (EET), which has a significant performance improvement over existing libraries. Compared to Faster Transformer v4.0's implementation for GPT-2 layer on A100, EET achieves a 1.5-4.5x state-of-art speedup varying with different context lengths. EET is available at https://github.com/NetEase-FuXi/EET. A demo video is available at https://youtu.be/22UPcNGcErg

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2104.12470

Last time updated on 03/06/2021