169 research outputs found
Multi-scale Orderless Pooling of Deep Convolutional Activation Features
Deep convolutional neural networks (CNN) have shown their promise as a
universal representation for recognition. However, global CNN activations lack
geometric invariance, which limits their robustness for classification and
matching of highly variable scenes. To improve the invariance of CNN
activations without degrading their discriminative power, this paper presents a
simple but effective scheme called multi-scale orderless pooling (MOP-CNN).
This scheme extracts CNN activations for local patches at multiple scale
levels, performs orderless VLAD pooling of these activations at each level
separately, and concatenates the result. The resulting MOP-CNN representation
can be used as a generic feature for either supervised or unsupervised
recognition tasks, from image classification to instance-level retrieval; it
consistently outperforms global CNN activations without requiring any joint
training of prediction layers for a particular target dataset. In absolute
terms, it achieves state-of-the-art results on the challenging SUN397 and MIT
Indoor Scenes classification datasets, and competitive results on
ILSVRC2012/2013 classification and INRIA Holidays retrieval datasets
Learning the Roots of Visual Domain Shift
In this paper we focus on the spatial nature of visual domain shift,
attempting to learn where domain adaptation originates in each given image of
the source and target set. We borrow concepts and techniques from the CNN
visualization literature, and learn domainnes maps able to localize the degree
of domain specificity in images. We derive from these maps features related to
different domainnes levels, and we show that by considering them as a
preprocessing step for a domain adaptation algorithm, the final classification
performance is strongly improved. Combined with the whole image representation,
these features provide state of the art results on the Office dataset.Comment: Extended Abstrac
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
We present our submission to the Microsoft Video to Language Challenge of
generating short captions describing videos in the challenge dataset. Our model
is based on the encoder--decoder pipeline, popular in image and video
captioning systems. We propose to utilize two different kinds of video
features, one to capture the video content in terms of objects and attributes,
and the other to capture the motion and action information. Using these diverse
features we train models specializing in two separate input sub-domains. We
then train an evaluator model which is used to pick the best caption from the
pool of candidates generated by these domain expert models. We argue that this
approach is better suited for the current video captioning task, compared to
using a single model, due to the diversity in the dataset.
Efficacy of our method is proven by the fact that it was rated best in MSR
Video to Language Challenge, as per human evaluation. Additionally, we were
ranked second in the automatic evaluation metrics based table
Compression of Deep Neural Networks on the Fly
Thanks to their state-of-the-art performance, deep neural networks are
increasingly used for object recognition. To achieve these results, they use
millions of parameters to be trained. However, when targeting embedded
applications the size of these models becomes problematic. As a consequence,
their usage on smartphones or other resource limited devices is prohibited. In
this paper we introduce a novel compression method for deep neural networks
that is performed during the learning phase. It consists in adding an extra
regularization term to the cost function of fully-connected layers. We combine
this method with Product Quantization (PQ) of the trained weights for higher
savings in storage consumption. We evaluate our method on two data sets (MNIST
and CIFAR10), on which we achieve significantly larger compression rates than
state-of-the-art methods
- …