36 research outputs found
Depth-adaptive Transformer
International audienceState of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers
Faster Depth-Adaptive Transformers
Depth-adaptive neural networks can dynamically adjust depths according to the
hardness of input words, and thus improve efficiency. The main challenge is how
to measure such hardness and decide the required depths (i.e., layers) to
conduct. Previous works generally build a halting unit to decide whether the
computation should continue or stop at each layer. As there is no specific
supervision of depth selection, the halting unit may be under-optimized and
inaccurate, which results in suboptimal and unstable performance when modeling
sentences. In this paper, we get rid of the halting unit and estimate the
required depths in advance, which yields a faster depth-adaptive model.
Specifically, two approaches are proposed to explicitly measure the hardness of
input words and estimate corresponding adaptive depth, namely 1) mutual
information (MI) based estimation and 2) reconstruction loss based estimation.
We conduct experiments on the text classification task with 24 datasets in
various sizes and domains. Results confirm that our approaches can speed up the
vanilla Transformer (up to 7x) while preserving high accuracy. Moreover,
efficiency and robustness are significantly improved when compared with other
depth-adaptive approaches.Comment: AAAI-2021. Code will appear at:
https://github.com/Adaxry/Adaptive-Transforme
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models
Encoder-decoder transformer models have achieved great success on various
vision-language (VL) tasks, but they suffer from high inference latency.
Typically, the decoder takes up most of the latency because of the
auto-regressive decoding. To accelerate the inference, we propose an approach
of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit
encoder-decoder transformer model which is trained with deep supervision so
that each of its decoder layers is capable of generating plausible predictions.
In addition, we leverage simple yet practical techniques, including shared
generation head and adaptation modules, to keep accuracy when exiting at
shallow decoder layers. Based on the multi-exit model, we perform step-level
dynamic early exit during inference, where the model may decide to use fewer
decoder layers based on its confidence of the current layer at each individual
decoding step. Considering different number of decoder layers may be used at
different decoding steps, we compute deeper-layer decoder features of previous
decoding steps just-in-time, which ensures the features from different decoding
steps are semantically aligned. We evaluate our approach with two
state-of-the-art encoder-decoder transformer models on various VL tasks. We
show our approach can reduce overall inference latency by 30%-60% with
comparable or even higher accuracy compared to baselines
Towards More Efficient Insertion Transformer with Fractional Positional Encoding
Auto-regressive neural sequence models have been shown to be effective across
text generation tasks. However, their left-to-right decoding order prevents
generation from being parallelized. Insertion Transformer (Stern et al., 2019)
is an attractive alternative that allows outputting multiple tokens in a single
generation step. Nevertheless, due to the incompatibility between absolute
positional encoding and insertion-based generation schemes, it needs to refresh
the encoding of every token in the generated partial hypothesis at each step,
which could be costly. We design a novel reusable positional encoding scheme
for insertion transformers called Fractional Positional Encoding (FPE), which
allows reusing representations calculated in previous steps. Empirical studies
on various text generation tasks demonstrate the effectiveness of FPE, which
leads to floating-point operation reduction and latency improvements on batched
decoding
Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network
The expectation to deploy a universal neural network for speech enhancement,
with the aim of improving noise robustness across diverse speech processing
tasks, faces challenges due to the existing lack of awareness within static
speech enhancement frameworks regarding the expected speech in downstream
modules. These limitations impede the effectiveness of static speech
enhancement approaches in achieving optimal performance for a range of speech
processing tasks, thereby challenging the notion of universal applicability.
The fundamental issue in achieving universal speech enhancement lies in
effectively informing the speech enhancement module about the features of
downstream modules. In this study, we present a novel weighting prediction
approach, which explicitly learns the task relationships from downstream
training information to address the core challenge of universal speech
enhancement. We found the role of deciding whether to employ data augmentation
techniques as crucial downstream training information. This decision
significantly impacts the expected speech and the performance of the speech
enhancement module. Moreover, we introduce a novel speech enhancement network,
the Plugin Speech Enhancement (Plugin-SE). The Plugin-SE is a dynamic neural
network that includes the speech enhancement module, gate module, and weight
prediction module. Experimental results demonstrate that the proposed Plugin-SE
approach is competitive or superior to other joint training methods across
various downstream tasks
Understanding the Effect of Model Compression on Social Bias in Large Language Models
Large Language Models (LLMs) trained with self-supervision on vast corpora of
web text fit to the social biases of that text. Without intervention, these
social biases persist in the model's predictions in downstream tasks, leading
to representational harm. Many strategies have been proposed to mitigate the
effects of inappropriate social biases learned during pretraining.
Simultaneously, methods for model compression have become increasingly popular
to reduce the computational burden of LLMs. Despite the popularity and need for
both approaches, little work has been done to explore the interplay between
these two. We perform a carefully controlled study of the impact of model
compression via quantization and knowledge distillation on measures of social
bias in LLMs. Longer pretraining and larger models led to higher social bias,
and quantization showed a regularizer effect with its best trade-off around 20%
of the original pretraining time.Comment: EMNLP 2023 Mai
Predicting Token Impact Towards Efficient Vision Transformer
Token filtering to reduce irrelevant tokens prior to self-attention is a
straightforward way to enable efficient vision Transformer. This is the first
work to view token filtering from a feature selection perspective, where we
weigh the importance of a token according to how much it can change the loss
once masked. If the loss changes greatly after masking a token of interest, it
means that such a token has a significant impact on the final decision and is
thus relevant. Otherwise, the token is less important for the final decision,
so it can be filtered out. After applying the token filtering module
generalized from the whole training data, the token number fed to the
self-attention module can be obviously reduced in the inference phase, leading
to much fewer computations in all the subsequent self-attention layers. The
token filter can be realized using a very simple network, where we utilize
multi-layer perceptron. Except for the uniqueness of performing token filtering
only once from the very beginning prior to self-attention, the other core
feature making our method different from the other token filters lies in the
predictability of token impact from a feature selection point of view. The
experiments show that the proposed method provides an efficient way to approach
a light weighted model after optimized with a backbone by means of fine tune,
which is easy to be deployed in comparison with the existing methods based on
training from scratch.Comment: 10 page