Search CORE

4 research outputs found

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Author: Jaggi Martin
Lin Tao
Mi Fei
Schütze Hinrich
Zhao Mengjie
Publication venue
Publication date: 01/11/2020
Field of study

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Author: Jaggi Martin
Lin Tao
Mi Fei
Schütze Hinrich
Zhao Mengjie
Publication venue
Publication date: 01/01/2020
Field of study

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning

arXiv.org e-Print Archive

Crossref

Open Access LMU

Learned Token Pruning for Transformers

Author: Gholami Amir
Hassoun Joseph
Keutzer Kurt
Kim Sehoon
Shen Sheng
Thorsley David
Publication venue
Publication date: 02/07/2021
Field of study

A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore, our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced

arXiv.org e-Print Archive