15,580 research outputs found
Real Time Image Saliency for Black Box Classifiers
In this work we develop a fast saliency detection method that can be applied
to any differentiable image classifier. We train a masking model to manipulate
the scores of the classifier by masking salient parts of the input image. Our
model generalises well to unseen images and requires a single forward pass to
perform saliency detection, therefore suitable for use in real-time systems. We
test our approach on CIFAR-10 and ImageNet datasets and show that the produced
saliency maps are easily interpretable, sharp, and free of artifacts. We
suggest a new metric for saliency and test our method on the ImageNet object
localisation task. We achieve results outperforming other weakly supervised
methods
From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
Video captioning in essential is a complex natural process, which is affected
by various uncertainties stemming from video content, subjective judgment, etc.
In this paper we build on the recent progress in using encoder-decoder
framework for video captioning and address what we find to be a critical
deficiency of the existing methods, that most of the decoders propagate
deterministic hidden states. Such complex uncertainty cannot be modeled
efficiently by the deterministic models. In this paper, we propose a generative
approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which
models the uncertainty observed in the data using latent stochastic variables.
Therefore, MS-RNN can improve the performance of video captioning, and generate
multiple sentences to describe a video considering different random factors.
Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with
both visual and textual features to capture a high-level representation. Then,
a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty
propagation by introducing latent variables. Experimental results on the
challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach
outperforms the state-of-the-art video captioning benchmarks
Integrating a Non-Uniformly Sampled Software Retina with a Deep CNN Model
We present a biologically inspired method for pre-processing images applied to CNNs
that reduces their memory requirements while increasing their invariance to scale and rotation
changes. Our method is based on the mammalian retino-cortical transform: a
mapping between a pseudo-randomly tessellated retina model (used to sample an input
image) and a CNN. The aim of this first pilot study is to demonstrate a functional retinaintegrated
CNN implementation and this produced the following results: a network using
the full retino-cortical transform yielded an F1 score of 0.80 on a test set during a 4-way
classification task, while an identical network not using the proposed method yielded an
F1 score of 0.86 on the same task. The method reduced the visual data by e×7, the input
data to the CNN by 40% and the number of CNN training epochs by 64%. These results
demonstrate the viability of our method and hint at the potential of exploiting functional
traits of natural vision systems in CNNs
Egocentric Perception using a Biologically Inspired Software Retina Integrated with a Deep CNN
We presented the concept of of a software retina, capable
of significant visual data reduction in combination with
scale and rotation invariance, for applications in egocentric
and robot vision at the first EPIC workshop in Amsterdam
[9]. Our method is based on the mammalian retino-cortical
transform: a mapping between a pseudo-randomly tessellated
retina model (used to sample an input image) and a
CNN. The aim of this first pilot study is to demonstrate a
functional retina-integrated CNN implementation and this
produced the following results: a network using the full
retino-cortical transform yielded an F1 score of 0.80 on a
test set during a 4-way classification task, while an identical
network not using the proposed method yielded an F1
score of 0.86 on the same task. On a 40K node retina the
method reduced the visual data bye×7, the input data to the
CNN by 40% and the number of CNN training epochs by
36%. These results demonstrate the viability of our method
and hint at the potential of exploiting functional traits of
natural vision systems in CNNs. In addition, to the above
study, we present further recent developments in porting
the retina to an Apple iPhone, an implementation in CUDA
C for NVIDIA GPU platforms and extensions of the retina
model we have adopted
Co-Regularized Deep Representations for Video Summarization
Compact keyframe-based video summaries are a popular way of generating
viewership on video sharing platforms. Yet, creating relevant and compelling
summaries for arbitrarily long videos with a small number of keyframes is a
challenging task. We propose a comprehensive keyframe-based summarization
framework combining deep convolutional neural networks and restricted Boltzmann
machines. An original co-regularization scheme is used to discover meaningful
subject-scene associations. The resulting multimodal representations are then
used to select highly-relevant keyframes. A comprehensive user study is
conducted comparing our proposed method to a variety of schemes, including the
summarization currently in use by one of the most popular video sharing
websites. The results show that our method consistently outperforms the
baseline schemes for any given amount of keyframes both in terms of
attractiveness and informativeness. The lead is even more significant for
smaller summaries.Comment: Video summarization, deep convolutional neural networks,
co-regularized restricted Boltzmann machine
- …