4 research outputs found
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
We present an efficient method of utilizing pretrained language models, where
we learn selective binary masks for pretrained weights in lieu of modifying
them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a
series of NLP tasks show that our masking scheme yields performance comparable
to finetuning, yet has a much smaller memory footprint when several tasks need
to be inferred simultaneously. Through intrinsic evaluations, we show that
representations computed by masked language models encode information necessary
for solving downstream tasks. Analyzing the loss landscape, we show that
masking and finetuning produce models that reside in minima that can be
connected by a line segment with nearly constant test accuracy. This confirms
that masking can be utilized as an efficient alternative to finetuning
Learned Token Pruning for Transformers
A major challenge in deploying transformer models is their prohibitive
inference cost, which quadratically scales with the input sequence length. This
makes it especially difficult to use transformers for processing long
sequences. To address this, we present a novel Learned Token Pruning (LTP)
method that reduces redundant tokens as the data passes through the different
layers of the transformer. In particular, LTP prunes tokens with an attention
score below a threshold value, which is learned during training. Importantly,
our threshold based method avoids algorithmically expensive operations such as
top-k token selection which are used in prior token pruning methods, and also
leads to structured pruning. We extensively test the performance of our
approach on multiple GLUE tasks and show that our learned threshold based
method consistently outperforms the prior state-of-the-art top-k token based
method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore,
our preliminary results show up to 1.4x and 1.9x throughput improvement on
Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy
drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch
and has been open-sourced