3,528 research outputs found
TAPAS: Two-pass Approximate Adaptive Sampling for Softmax
TAPAS is a novel adaptive sampling method for the softmax model. It uses a
two pass sampling strategy where the examples used to approximate the gradient
of the partition function are first sampled according to a squashed population
distribution and then resampled adaptively using the context and current model.
We describe an efficient distributed implementation of TAPAS. We show, on both
synthetic data and a large real dataset, that TAPAS has low computational
overhead and works well for minimizing the rank loss for multi-class
classification problems with a very large label space
Accelerated Training for Massive Classification via Dynamic Class Selection
Massive classification, a classification task defined over a vast number of
classes (hundreds of thousands or even millions), has become an essential part
of many real-world systems, such as face recognition. Existing methods,
including the deep networks that achieved remarkable success in recent years,
were mostly devised for problems with a moderate number of classes. They would
meet with substantial difficulties, e.g. excessive memory demand and
computational cost, when applied to massive problems. We present a new method
to tackle this problem. This method can efficiently and accurately identify a
small number of "active classes" for each mini-batch, based on a set of dynamic
class hierarchies constructed on the fly. We also develop an adaptive
allocation scheme thereon, which leads to a better tradeoff between performance
and cost. On several large-scale benchmarks, our method significantly reduces
the training cost and memory demand, while maintaining competitive performance.Comment: 8 pages, 6 figures, AAAI 201
Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study
Bayesian neural Networks (BNNs) are a promising method of obtaining
statistical uncertainties for neural network predictions but with a higher
computational overhead which can limit their practical usage. This work
explores the use of high performance computing with distributed training to
address the challenges of training BNNs at scale. We present a performance and
scalability comparison of training the VGG-16 and Resnet-18 models on a
Cray-XC40 cluster. We demonstrate that network pruning can speed up inference
without accuracy loss and provide an open source software package,
{\it{BPrune}} to automate this pruning. For certain models we find that pruning
up to 80\% of the network results in only a 7.0\% loss in accuracy. With the
development of new hardware accelerators for Deep Learning, BNNs are of
considerable interest for benchmarking performance. This analysis of training a
BNN at scale outlines the limitations and benefits compared to a conventional
neural network
Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction
A major obstacle in reinforcement learning-based sentence generation is the
large action space whose size is equal to the vocabulary size of the
target-side language. To improve the efficiency of reinforcement learning, we
present a novel approach for reducing the action space based on dynamic
vocabulary prediction. Our method first predicts a fixed-size small vocabulary
for each input to generate its target sentence. The input-specific vocabularies
are then used at supervised and reinforcement learning steps, and also at test
time. In our experiments on six machine translation and two image captioning
datasets, our method achieves faster reinforcement learning (2.7x faster)
with less GPU memory (2.3x less) than the full-vocabulary counterpart.
The reinforcement learning with our method consistently leads to significant
improvement of BLEU scores, and the scores are equal to or better than those of
baselines using the full vocabularies, with faster decoding time (3x
faster) on CPUs.Comment: NAACL2019 camera ready (mini-batch splitting is added
Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models
Neural language models (NLMs) have recently gained a renewed interest by
achieving state-of-the-art performance across many natural language processing
(NLP) tasks. However, NLMs are very computationally demanding largely due to
the computational cost of the softmax layer over a large vocabulary. We observe
that, in decoding of many NLP tasks, only the probabilities of the top-K
hypotheses need to be calculated preciously and K is often much smaller than
the vocabulary size. This paper proposes a novel softmax layer approximation
algorithm, called Fast Graph Decoder (FGD), which quickly identifies, for a
given context, a set of K words that are most likely to occur according to a
NLM. We demonstrate that FGD reduces the decoding time by an order of magnitude
while attaining close to the full softmax baseline accuracy on neural machine
translation and language modeling tasks. We also prove the theoretical
guarantee on the softmax approximation quality
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Neural language models have been widely used in various NLP tasks, including
machine translation, next word prediction and conversational agents. However,
it is challenging to deploy these models on mobile devices due to their slow
prediction speed, where the bottleneck is to compute top candidates in the
softmax layer. In this paper, we introduce a novel softmax layer approximation
algorithm by exploiting the clustering structure of context vectors. Our
algorithm uses a light-weight screening model to predict a much smaller set of
candidate words based on the given context, and then conducts an exact softmax
only within that subset. Training such a procedure end-to-end is challenging as
traditional clustering methods are discrete and non-differentiable, and thus
unable to be used with back-propagation in the training process. Using the
Gumbel softmax, we are able to train the screening model end-to-end on the
training set to exploit data distribution. The algorithm achieves an order of
magnitude faster inference than the original softmax layer for predicting
top- words in various tasks such as beam search in machine translation or
next words prediction. For example, for machine translation task on German to
English dataset with around 25K vocabulary, we can achieve 20.4 times speed up
with 98.9\% precision@1 and 99.3\% precision@5 with the original softmax layer
prediction, while state-of-the-art ~\citep{MSRprediction} only achieves 6.7x
speedup with 98.7\% precision@1 and 98.1\% precision@5 for the same task
Toward Computation and Memory Efficient Neural Network Acoustic Models with Binary Weights and Activations
Neural network acoustic models have significantly advanced state of the art
speech recognition over the past few years. However, they are usually
computationally expensive due to the large number of matrix-vector
multiplications and nonlinearity operations. Neural network models also require
significant amounts of memory for inference because of the large model size.
For these two reasons, it is challenging to deploy neural network based speech
recognizers on resource-constrained platforms such as embedded devices. This
paper investigates the use of binary weights and activations for computation
and memory efficient neural network acoustic models. Compared to real-valued
weight matrices, binary weights require much fewer bits for storage, thereby
cutting down the memory footprint. Furthermore, with binary weights or
activations, the matrix-vector multiplications are turned into addition and
subtraction operations, which are computationally much faster and more energy
efficient for hardware platforms. In this paper, we study the applications of
binary weights and activations for neural network acoustic modeling, reporting
encouraging results on the WSJ and AMI corpora.Comment: 7 pages, 3 figure
-Nets: Double Attention Networks
Learning to capture long-range relations is fundamental to image/video
recognition. Existing CNN models generally rely on increasing depth to model
such relations which is highly inefficient. In this work, we propose the
"double attention block", a novel component that aggregates and propagates
informative global features from the entire spatio-temporal space of input
images/videos, enabling subsequent convolution layers to access features from
the entire space efficiently. The component is designed with a double attention
mechanism in two steps, where the first step gathers features from the entire
space into a compact set through second-order attention pooling and the second
step adaptively selects and distributes features to each location via another
attention. The proposed double attention block is easy to adopt and can be
plugged into existing deep neural networks conveniently. We conduct extensive
ablation studies and experiments on both image and video recognition tasks for
evaluating its performance. On the image recognition task, a ResNet-50 equipped
with our double attention blocks outperforms a much larger ResNet-152
architecture on ImageNet-1k dataset with over 40% less the number of parameters
and less FLOPs. On the action recognition task, our proposed model achieves the
state-of-the-art results on the Kinetics and UCF-101 datasets with
significantly higher efficiency than recent works.Comment: Accepted at NIPS 201
Adaptive Sampled Softmax with Kernel Based Sampling
Softmax is the most commonly used output function for multiclass problems and
is widely used in areas such as vision, natural language processing, and
recommendation. A softmax model has linear costs in the number of classes which
makes it too expensive for many real-world problems. A common approach to speed
up training involves sampling only some of the classes at each training step.
It is known that this method is biased and that the bias increases the more the
sampling distribution deviates from the output distribution. Nevertheless,
almost any recent work uses simple sampling distributions that require a large
sample size to mitigate the bias. In this work, we propose a new class of
kernel based sampling methods and develop an efficient sampling algorithm.
Kernel based sampling adapts to the model as it is trained, thus resulting in
low bias. Kernel based sampling can be easily applied to many models because it
relies only on the model's last hidden layer. We empirically study the
trade-off of bias, sampling distribution and sample size and show that kernel
based sampling results in low bias with few samples
Adaptive Input Representations for Neural Language Modeling
We introduce adaptive input representations for neural language modeling
which extend the adaptive softmax of Grave et al. (2017) to input
representations of variable capacity. There are several choices on how to
factorize the input and output layers, and whether to model words, characters
or sub-word units. We perform a systematic comparison of popular choices for a
self-attentional architecture. Our experiments show that models equipped with
adaptive embeddings are more than twice as fast to train than the popular
character input CNN while having a lower number of parameters. On the
WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5
perplexity compared to the previously best published result and on the Billion
Word benchmark, we achieve 23.02 perplexity.Comment: 12 page
- …