1,975 research outputs found
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow
Diving Deep into Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment Prediction
Visual media are powerful means of expressing emotions and sentiments. The
constant generation of new content in social networks highlights the need of
automated visual sentiment analysis tools. While Convolutional Neural Networks
(CNNs) have established a new state-of-the-art in several vision problems,
their application to the task of sentiment analysis is mostly unexplored and
there are few studies regarding how to design CNNs for this purpose. In this
work, we study the suitability of fine-tuning a CNN for visual sentiment
prediction as well as explore performance boosting techniques within this deep
learning setting. Finally, we provide a deep-dive analysis into a benchmark,
state-of-the-art network architecture to gain insight about how to design
patterns for CNNs on the task of visual sentiment prediction.Comment: Preprint of the paper accepted at the 1st Workshop on Affect and
Sentiment in Multimedia (ASM), in ACM MultiMedia 2015. Brisbane, Australi
Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-tuning
Deep neural networks require a large amount of labeled training data during
supervised learning. However, collecting and labeling so much data might be
infeasible in many cases. In this paper, we introduce a source-target selective
joint fine-tuning scheme for improving the performance of deep learning tasks
with insufficient training data. In this scheme, a target learning task with
insufficient training data is carried out simultaneously with another source
learning task with abundant training data. However, the source learning task
does not use all existing training data. Our core idea is to identify and use a
subset of training images from the original source learning task whose
low-level characteristics are similar to those from the target learning task,
and jointly fine-tune shared convolutional layers for both tasks. Specifically,
we compute descriptors from linear or nonlinear filter bank responses on
training images from both tasks, and use such descriptors to search for a
desired subset of training samples for the source learning task.
Experiments demonstrate that our selective joint fine-tuning scheme achieves
state-of-the-art performance on multiple visual classification tasks with
insufficient training data for deep learning. Such tasks include Caltech 256,
MIT Indoor 67, Oxford Flowers 102 and Stanford Dogs 120. In comparison to
fine-tuning without a source domain, the proposed method can improve the
classification accuracy by 2% - 10% using a single model.Comment: To appear in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2017
Exploiting Local Features from Deep Networks for Image Retrieval
Deep convolutional neural networks have been successfully applied to image
classification tasks. When these same networks have been applied to image
retrieval, the assumption has been made that the last layers would give the
best performance, as they do in classification. We show that for instance-level
image retrieval, lower layers often perform better than the last layers in
convolutional neural networks. We present an approach for extracting
convolutional features from different layers of the networks, and adopt VLAD
encoding to encode features into a single vector for each image. We investigate
the effect of different layers and scales of input images on the performance of
convolutional features using the recent deep networks OxfordNet and GoogLeNet.
Experiments demonstrate that intermediate layers or higher layers with finer
scales produce better results for image retrieval, compared to the last layer.
When using compressed 128-D VLAD descriptors, our method obtains
state-of-the-art results and outperforms other VLAD and CNN based approaches on
two out of three test datasets. Our work provides guidance for transferring
deep networks trained on image classification to image retrieval tasks.Comment: CVPR DeepVision Workshop 201
- …