1,241 research outputs found
Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
Videos are inherently multimodal. This paper studies the problem of how to
fully exploit the abundant multimodal clues for improved video categorization.
We introduce a hybrid deep learning framework that integrates useful clues from
multiple modalities, including static spatial appearance information, motion
patterns within a short time window, audio information as well as long-range
temporal dynamics. More specifically, we utilize three Convolutional Neural
Networks (CNNs) operating on appearance, motion and audio signals to extract
their corresponding features. We then employ a feature fusion network to derive
a unified representation with an aim to capture the relationships among
features. Furthermore, to exploit the long-range temporal dynamics in videos,
we apply two Long Short Term Memory networks with extracted appearance and
motion features as inputs. Finally, we also propose to refine the prediction
scores by leveraging contextual relationships among video semantics. The hybrid
deep learning framework is able to exploit a comprehensive set of multimodal
features for video classification. Through an extensive set of experiments, we
demonstrate that (1) LSTM networks which model sequences in an explicitly
recurrent manner are highly complementary with CNN models; (2) the feature
fusion network which produces a fused representation through modeling feature
relationships outperforms alternative fusion strategies; (3) the semantic
context of video classes can help further refine the predictions for improved
performance. Experimental results on two challenging benchmarks, the UCF-101
and the Columbia Consumer Videos (CCV), provide strong quantitative evidence
that our framework achieves promising results: on the UCF-101 and
on the CCV, outperforming competing methods with clear margins
An end-to-end generative framework for video segmentation and recognition
We describe an end-to-end generative approach for the segmentation and
recognition of human activities. In this approach, a visual representation
based on reduced Fisher Vectors is combined with a structured temporal model
for recognition. We show that the statistical properties of Fisher Vectors make
them an especially suitable front-end for generative models such as Gaussian
mixtures. The system is evaluated for both the recognition of complex
activities as well as their parsing into action units. Using a variety of video
datasets ranging from human cooking activities to animal behaviors, our
experiments demonstrate that the resulting architecture outperforms
state-of-the-art approaches for larger datasets, i.e. when sufficient amount of
data is available for training structured generative models.Comment: Proc. of IEEE Winter Conference on Applications of Computer Vision
(WACV), 201
Detect2Rank : Combining Object Detectors Using Learning to Rank
Object detection is an important research area in the field of computer
vision. Many detection algorithms have been proposed. However, each object
detector relies on specific assumptions of the object appearance and imaging
conditions. As a consequence, no algorithm can be considered as universal. With
the large variety of object detectors, the subsequent question is how to select
and combine them.
In this paper, we propose a framework to learn how to combine object
detectors. The proposed method uses (single) detectors like DPM, CN and EES,
and exploits their correlation by high level contextual features to yield a
combined detection list.
Experiments on the PASCAL VOC07 and VOC10 datasets show that the proposed
method significantly outperforms single object detectors, DPM (8.4%), CN (6.8%)
and EES (17.0%) on VOC07 and DPM (6.5%), CN (5.5%) and EES (16.2%) on VOC10
Multi-scale 3D Convolution Network for Video Based Person Re-Identification
This paper proposes a two-stream convolution network to extract spatial and
temporal cues for video based person Re-Identification (ReID). A temporal
stream in this network is constructed by inserting several Multi-scale 3D (M3D)
convolution layers into a 2D CNN network. The resulting M3D convolution network
introduces a fraction of parameters into the 2D CNN, but gains the ability of
multi-scale temporal feature learning. With this compact architecture, M3D
convolution network is also more efficient and easier to optimize than existing
3D convolution networks. The temporal stream further involves Residual
Attention Layers (RAL) to refine the temporal features. By jointly learning
spatial-temporal attention masks in a residual manner, RAL identifies the
discriminative spatial regions and temporal cues. The other stream in our
network is implemented with a 2D CNN for spatial feature extraction. The
spatial and temporal features from two streams are finally fused for the video
based person ReID. Evaluations on three widely used benchmarks datasets, i.e.,
MARS, PRID2011, and iLIDS-VID demonstrate the substantial advantages of our
method over existing 3D convolution networks and state-of-art methods.Comment: AAAI, 201
GRAINS: Generative Recursive Autoencoders for INdoor Scenes
We present a generative neural network which enables us to generate plausible
3D indoor scenes in large quantities and varieties, easily and highly
efficiently. Our key observation is that indoor scene structures are inherently
hierarchical. Hence, our network is not convolutional; it is a recursive neural
network or RvNN. Using a dataset of annotated scene hierarchies, we train a
variational recursive autoencoder, or RvNN-VAE, which performs scene object
grouping during its encoding phase and scene generation during decoding.
Specifically, a set of encoders are recursively applied to group 3D objects
based on support, surround, and co-occurrence relations in a scene, encoding
information about object spatial properties, semantics, and their relative
positioning with respect to other objects in the hierarchy. By training a
variational autoencoder (VAE), the resulting fixed-length codes roughly follow
a Gaussian distribution. A novel 3D scene can be generated hierarchically by
the decoder from a randomly sampled code from the learned distribution. We coin
our method GRAINS, for Generative Recursive Autoencoders for INdoor Scenes. We
demonstrate the capability of GRAINS to generate plausible and diverse 3D
indoor scenes and compare with existing methods for 3D scene synthesis. We show
applications of GRAINS including 3D scene modeling from 2D layouts, scene
editing, and semantic scene segmentation via PointNet whose performance is
boosted by the large quantity and variety of 3D scenes generated by our method.Comment: 21 pages, 26 figure
Learning to Hash-tag Videos with Tag2Vec
User-given tags or labels are valuable resources for semantic understanding
of visual media such as images and videos. Recently, a new type of labeling
mechanism known as hash-tags have become increasingly popular on social media
sites. In this paper, we study the problem of generating relevant and useful
hash-tags for short video clips. Traditional data-driven approaches for tag
enrichment and recommendation use direct visual similarity for label transfer
and propagation. We attempt to learn a direct low-cost mapping from video to
hash-tags using a two step training process. We first employ a natural language
processing (NLP) technique, skip-gram models with neural network training to
learn a low-dimensional vector representation of hash-tags (Tag2Vec) using a
corpus of 10 million hash-tags. We then train an embedding function to map
video features to the low-dimensional Tag2vec space. We learn this embedding
for 29 categories of short video clips with hash-tags. A query video without
any tag-information can then be directly mapped to the vector space of tags
using the learned embedding and relevant tags can be found by performing a
simple nearest-neighbor retrieval in the Tag2Vec space. We validate the
relevance of the tags suggested by our system qualitatively and quantitatively
with a user study
Action Classification and Highlighting in Videos
Inspired by recent advances in neural machine translation, that jointly align
and translate using encoder-decoder networks equipped with attention, we
propose an attentionbased LSTM model for human activity recognition. Our model
jointly learns to classify actions and highlight frames associated with the
action, by attending to salient visual information through a jointly learned
soft-attention networks. We explore attention informed by various forms of
visual semantic features, including those encoding actions, objects and scenes.
We qualitatively show that soft-attention can learn to effectively attend to
important objects and scene information correlated with specific human actions.
Further, we show that, quantitatively, our attention-based LSTM outperforms the
vanilla LSTM and CNN models used by stateof-the-art methods. On a large-scale
youtube video dataset, ActivityNet, our model outperforms competing methods in
action classification
PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities
Data of different modalities generally convey complimentary but heterogeneous
information, and a more discriminative representation is often preferred by
combining multiple data modalities like the RGB and infrared features. However
in reality, obtaining both data channels is challenging due to many
limitations. For example, the RGB surveillance cameras are often restricted
from private spaces, which is in conflict with the need of abnormal activity
detection for personal security. As a result, using partial data channels to
build a full representation of multi-modalities is clearly desired. In this
paper, we propose a novel Partial-modal Generative Adversarial Networks
(PM-GANs) that learns a full-modal representation using data from only partial
modalities. The full representation is achieved by a generated representation
in place of the missing data channel. Extensive experiments are conducted to
verify the performance of our proposed method on action recognition, compared
with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset
for action recognition is introduced, and will be the first publicly available
action dataset that contains paired infrared and visible spectrum
Visual Affordance and Function Understanding: A Survey
Nowadays, robots are dominating the manufacturing, entertainment and
healthcare industries. Robot vision aims to equip robots with the ability to
discover information, understand it and interact with the environment. These
capabilities require an agent to effectively understand object affordances and
functionalities in complex visual domains. In this literature survey, we first
focus on Visual affordances and summarize the state of the art as well as open
problems and research gaps. Specifically, we discuss sub-problems such as
affordance detection, categorization, segmentation and high-level reasoning.
Furthermore, we cover functional scene understanding and the prevalent
functional descriptors used in the literature. The survey also provides
necessary background to the problem, sheds light on its significance and
highlights the existing challenges for affordance and functionality learning.Comment: 26 pages, 22 image
Rethinking Full Connectivity in Recurrent Neural Networks
Recurrent neural networks (RNNs) are omnipresent in sequence modeling tasks.
Practical models usually consist of several layers of hundreds or thousands of
neurons which are fully connected. This places a heavy computational and memory
burden on hardware, restricting adoption in practical low-cost and low-power
devices. Compared to fully convolutional models, the costly sequential
operation of RNNs severely hinders performance on parallel hardware. This paper
challenges the convention of full connectivity in RNNs. We study structurally
sparse RNNs, showing that they are well suited for acceleration on parallel
hardware, with a greatly reduced cost of the recurrent operations as well as
orders of magnitude less recurrent weights. Extensive experiments on
challenging tasks ranging from language modeling and speech recognition to
video action recognition reveal that structurally sparse RNNs achieve
competitive performance as compared to fully-connected networks. This allows
for using large sparse RNNs for a wide range of real-world tasks that
previously were too costly with fully connected networks
- …