21,195 research outputs found
Learned Token Pruning for Transformers
A major challenge in deploying transformer models is their prohibitive
inference cost, which quadratically scales with the input sequence length. This
makes it especially difficult to use transformers for processing long
sequences. To address this, we present a novel Learned Token Pruning (LTP)
method that reduces redundant tokens as the data passes through the different
layers of the transformer. In particular, LTP prunes tokens with an attention
score below a threshold value, which is learned during training. Importantly,
our threshold based method avoids algorithmically expensive operations such as
top-k token selection which are used in prior token pruning methods, and also
leads to structured pruning. We extensively test the performance of our
approach on multiple GLUE tasks and show that our learned threshold based
method consistently outperforms the prior state-of-the-art top-k token based
method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore,
our preliminary results show up to 1.4x and 1.9x throughput improvement on
Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy
drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch
and has been open-sourced
Automated Pruning for Deep Neural Network Compression
In this work we present a method to improve the pruning step of the current
state-of-the-art methodology to compress neural networks. The novelty of the
proposed pruning technique is in its differentiability, which allows pruning to
be performed during the backpropagation phase of the network training. This
enables an end-to-end learning and strongly reduces the training time. The
technique is based on a family of differentiable pruning functions and a new
regularizer specifically designed to enforce pruning. The experimental results
show that the joint optimization of both the thresholds and the network weights
permits to reach a higher compression rate, reducing the number of weights of
the pruned network by a further 14% to 33% compared to the current
state-of-the-art. Furthermore, we believe that this is the first study where
the generalization capabilities in transfer learning tasks of the features
extracted by a pruned network are analyzed. To achieve this goal, we show that
the representations learned using the proposed pruning methodology maintain the
same effectiveness and generality of those learned by the corresponding
non-compressed network on a set of different recognition tasks.Comment: 8 pages, 5 figures. Published as a conference paper at ICPR 201
Asymmetric Pruning for Learning Cascade Detectors
Cascade classifiers are one of the most important contributions to real-time
object detection. Nonetheless, there are many challenging problems arising in
training cascade detectors. One common issue is that the node classifier is
trained with a symmetric classifier. Having a low misclassification error rate
does not guarantee an optimal node learning goal in cascade classifiers, i.e.,
an extremely high detection rate with a moderate false positive rate. In this
work, we present a new approach to train an effective node classifier in a
cascade detector. The algorithm is based on two key observations: 1) Redundant
weak classifiers can be safely discarded; 2) The final detector should satisfy
the asymmetric learning objective of the cascade architecture. To achieve this,
we separate the classifier training into two steps: finding a pool of
discriminative weak classifiers/features and training the final classifier by
pruning weak classifiers which contribute little to the asymmetric learning
criterion (asymmetric classifier construction). Our model reduction approach
helps accelerate the learning time while achieving the pre-determined learning
objective. Experimental results on both face and car data sets verify the
effectiveness of the proposed algorithm. On the FDDB face data sets, our
approach achieves the state-of-the-art performance, which demonstrates the
advantage of our approach.Comment: 14 page
Implicit Filter Sparsification In Convolutional Neural Networks
We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. Through an extensive empirical study (Mehta et al., 2019) we hypothesize the mechanism behind the sparsification process, and find surprising links to certain filter sparsification heuristics proposed in literature. Emergence of, and the subsequent pruning of selective features is observed to be one of the contributing mechanisms, leading to feature sparsity at par or better than certain explicit sparsification / pruning approaches. In this workshop article we summarize our findings, and point out corollaries of selective-featurepenalization which could also be employed as heuristics for filter prunin
Implicit Filter Sparsification In Convolutional Neural Networks
We show implicit filter level sparsity manifests in convolutional neural
networks (CNNs) which employ Batch Normalization and ReLU activation, and are
trained with adaptive gradient descent techniques and L2 regularization or
weight decay. Through an extensive empirical study (Mehta et al., 2019) we
hypothesize the mechanism behind the sparsification process, and find
surprising links to certain filter sparsification heuristics proposed in
literature. Emergence of, and the subsequent pruning of selective features is
observed to be one of the contributing mechanisms, leading to feature sparsity
at par or better than certain explicit sparsification / pruning approaches. In
this workshop article we summarize our findings, and point out corollaries of
selective-featurepenalization which could also be employed as heuristics for
filter pruningComment: ODML-CDNNR 2019 (ICML'19 workshop) extended abstract of the CVPR 2019
paper "On Implicit Filter Level Sparsity in Convolutional Neural Networks,
Mehta et al." (arXiv:1811.12495
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
- …