1,153 research outputs found
Excitation Dropout: Encouraging Plasticity in Deep Neural Networks
We propose a guided dropout regularizer for deep networks based on the
evidence of a network prediction defined as the firing of neurons in specific
paths. In this work, we utilize the evidence at each neuron to determine the
probability of dropout, rather than dropping out neurons uniformly at random as
in standard dropout. In essence, we dropout with higher probability those
neurons which contribute more to decision making at training time. This
approach penalizes high saliency neurons that are most relevant for model
prediction, i.e. those having stronger evidence. By dropping such high-saliency
neurons, the network is forced to learn alternative paths in order to maintain
loss minimization, resulting in a plasticity-like behavior, a characteristic of
human brains too. We demonstrate better generalization ability, an increased
utilization of network neurons, and a higher resilience to network compression
using several metrics over four image/video recognition benchmarks
Excitation Backprop for RNNs
Deep models are state-of-the-art for many vision tasks including video action
recognition and video captioning. Models are trained to caption or classify
activity in videos, but little is known about the evidence used to make such
decisions. Grounding decisions made by deep networks has been studied in
spatial visual content, giving more insight into model predictions for images.
However, such studies are relatively lacking for models of spatiotemporal
visual content - videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a single pass, using
top-down saliency. We visualize the spatiotemporal cues that contribute to a
deep model's classification/captioning output using the model's internal
representation. Based on these spatiotemporal cues, we are able to localize
segments within a video that correspond with a specific action, or phrase from
a caption, without explicitly optimizing/training for these tasks.Comment: CVPR 2018 Camera Ready Versio
Grounding deep models of visual data
Deep models are state-of-the-art for many computer vision tasks including object classification, action recognition, and captioning. As Artificial Intelligence systems that utilize deep models are becoming ubiquitous, it is also becoming crucial to explain why they make certain decisions: Grounding model decisions. In this thesis, we study: 1) Improving Model Classification. We show that by utilizing web action images along with videos in training for action recognition, significant performance boosts of convolutional models can be achieved. Without explicit grounding, labeled web action images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. 2) Spatial Grounding. We visualize spatial evidence of deep model predictions using a discriminative top-down attention mechanism, called Excitation Backprop. We show how such visualizations are equally informative for correct and incorrect model predictions, and highlight the shift of focus when different training strategies are adopted. 3) Spatial Grounding for Improving Model Classification at Training Time. We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction. This approach penalizes neurons that are most relevant for model prediction. By dropping such high-saliency neurons, the network is forced to learn alternative paths in order to maintain loss minimization. We demonstrate better generalization ability, an increased utilization of network neurons, and a higher resilience to network compression. 4) Spatial Grounding for Improving Model Classification at Test Time. We propose Guided Zoom, an approach that utilizes spatial grounding to make more informed predictions at test time. Guided Zoom compares the evidence used to make a preliminary decision with the evidence of correctly classified training examples to ensure evidenceprediction consistency, otherwise refines the prediction. We demonstrate accuracy gains for fine-grained classification. 5) Spatiotemporal Grounding. We devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep recurrent neural network’s classification/captioning output. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks
BlockDrop: Dynamic Inference Paths in Residual Networks
Very deep convolutional neural networks offer excellent recognition results,
yet their computational expense limits their impact for many real-world
applications. We introduce BlockDrop, an approach that learns to dynamically
choose which layers of a deep network to execute during inference so as to best
reduce total computation without degrading prediction accuracy. Exploiting the
robustness of Residual Networks (ResNets) to layer dropping, our framework
selects on-the-fly which residual blocks to evaluate for a given novel image.
In particular, given a pretrained ResNet, we train a policy network in an
associative reinforcement learning setting for the dual reward of utilizing a
minimal number of blocks while preserving recognition accuracy. We conduct
extensive experiments on CIFAR and ImageNet. The results provide strong
quantitative and qualitative evidence that these learned policies not only
accelerate inference but also encode meaningful visual information. Built upon
a ResNet-101 model, our method achieves a speedup of 20\% on average, going as
high as 36\% for some images, while maintaining the same 76.4\% top-1 accuracy
on ImageNet.Comment: CVPR 201
Deep learning model combination and regularization using convolutional neural networks
Convolutional neural networks (CNNs) were inspired by biology. They are hierarchical neural
networks whose convolutional layers alternate with subsampling layers, reminiscent of simple
and complex cells in the primary visual cortex [Fuk86a]. In the last years, CNNs have emerged
as a powerful machine learning model and achieved the best results in many object recognition
benchmarks [ZF13, HSK+12, LCY14, CMMS12].
In this dissertation, we introduce two new proposals for convolutional neural networks. The
first, is a method to combine the output probabilities of CNNs which we call Weighted
Convolutional Neural Network Ensemble. Each network has an associated weight that makes
networks with better performance have a greater influence at the time to classify a pattern
when compared to networks that performed worse. This new approach produces better results
than the common method that combines the networks doing just the average of the output
probabilities to make the predictions. The second, which we call DropAll, is a generalization
of two well-known methods for regularization of fully-connected layers within convolutional
neural networks, DropOut [HSK+12] and DropConnect [WZZ+13]. Applying these methods
amounts to sub-sampling a neural network by dropping units. When training with DropOut, a
randomly selected subset of the output layer’s activations are dropped, when training with
DropConnect we drop a randomly subsets of weights. With DropAll we can perform both methods
simultaneously. We show the validity of our proposals by improving the classification error on a
common image classification benchmark
Video Time: Properties, Encoders and Evaluation
Time-aware encoding of frame sequences in a video is a fundamental problem in
video understanding. While many attempted to model time in videos, an explicit
study on quantifying video time is missing. To fill this lacuna, we aim to
evaluate video time explicitly. We describe three properties of video time,
namely a) temporal asymmetry, b)temporal continuity and c) temporal causality.
Based on each we formulate a task able to quantify the associated property.
This allows assessing the effectiveness of modern video encoders, like C3D and
LSTM, in their ability to model time. Our analysis provides insights about
existing encoders while also leading us to propose a new video time encoder,
which is better suited for the video time recognition tasks than C3D and LSTM.
We believe the proposed meta-analysis can provide a reasonable baseline to
assess video time encoders on equal grounds on a set of temporal-aware tasks.Comment: 14 pages, BMVC 201
Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks
While the use of bottom-up local operators in convolutional neural networks
(CNNs) matches well some of the statistics of natural images, it may also
prevent such models from capturing contextual long-range feature interactions.
In this work, we propose a simple, lightweight approach for better context
exploitation in CNNs. We do so by introducing a pair of operators: gather,
which efficiently aggregates feature responses from a large spatial extent, and
excite, which redistributes the pooled information to local features. The
operators are cheap, both in terms of number of added parameters and
computational complexity, and can be integrated directly in existing
architectures to improve their performance. Experiments on several datasets
show that gather-excite can bring benefits comparable to increasing the depth
of a CNN at a fraction of the cost. For example, we find ResNet-50 with
gather-excite operators is able to outperform its 101-layer counterpart on
ImageNet with no additional learnable parameters. We also propose a parametric
gather-excite operator pair which yields further performance gains, relate it
to the recently-introduced Squeeze-and-Excitation Networks, and analyse the
effects of these changes to the CNN feature activation statistics.Comment: NeurIPS 201
- …