55,904 research outputs found
Feature and Region Selection for Visual Learning
Visual learning problems such as object classification and action recognition
are typically approached using extensions of the popular bag-of-words (BoW)
model. Despite its great success, it is unclear what visual features the BoW
model is learning: Which regions in the image or video are used to discriminate
among classes? Which are the most discriminative visual words? Answering these
questions is fundamental for understanding existing BoW models and inspiring
better models for visual recognition.
To answer these questions, this paper presents a method for feature selection
and region selection in the visual BoW model. This allows for an intermediate
visualization of the features and regions that are important for visual
learning. The main idea is to assign latent weights to the features or regions,
and jointly optimize these latent variables with the parameters of a classifier
(e.g., support vector machine). There are four main benefits of our approach:
(1) Our approach accommodates non-linear additive kernels such as the popular
and intersection kernel; (2) our approach is able to handle both
regions in images and spatio-temporal regions in videos in a unified way; (3)
the feature selection problem is convex, and both problems can be solved using
a scalable reduced gradient method; (4) we point out strong connections with
multiple kernel learning and multiple instance learning approaches.
Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube
illustrate the benefits of our approach
3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks
Human activity understanding with 3D/depth sensors has received increasing
attention in multimedia processing and interactions. This work targets on
developing a novel deep model for automatic activity recognition from RGB-D
videos. We represent each human activity as an ensemble of cubic-like video
segments, and learn to discover the temporal structures for a category of
activities, i.e. how the activities to be decomposed in terms of
classification. Our model can be regarded as a structured deep architecture, as
it extends the convolutional neural networks (CNNs) by incorporating structure
alternatives. Specifically, we build the network consisting of 3D convolutions
and max-pooling operators over the video segments, and introduce the latent
variables in each convolutional layer manipulating the activation of neurons.
Our model thus advances existing approaches in two aspects: (i) it acts
directly on the raw inputs (grayscale-depth data) to conduct recognition
instead of relying on hand-crafted features, and (ii) the model structure can
be dynamically adjusted accounting for the temporal variations of human
activities, i.e. the network configuration is allowed to be partially activated
during inference. For model training, we propose an EM-type optimization method
that iteratively (i) discovers the latent structure by determining the
decomposed actions for each training example, and (ii) learns the network
parameters by using the back-propagation algorithm. Our approach is validated
in challenging scenarios, and outperforms state-of-the-art methods. A large
human activity database of RGB-D videos is presented in addition.Comment: This manuscript has 10 pages with 9 figures, and a preliminary
version was published in ACM MM'14 conferenc
On inferring intentions in shared tasks for industrial collaborative robots
Inferring human operators' actions in shared collaborative tasks, plays a crucial role in enhancing the cognitive capabilities of industrial robots. In all these incipient collaborative robotic applications, humans and robots not only should share space but also forces and the execution of a task. In this article, we present a robotic system which is able to identify different human's intentions and to adapt its behavior consequently, only by means of force data. In order to accomplish this aim, three major contributions are presented: (a) force-based operator's intent recognition, (b) force-based dataset of physical human-robot interaction and (c) validation of the whole system in a scenario inspired by a realistic industrial application. This work is an important step towards a more natural and user-friendly manner of physical human-robot interaction in scenarios where humans and robots collaborate in the accomplishment of a task.Peer ReviewedPostprint (published version
An Expressive Deep Model for Human Action Parsing from A Single Image
This paper aims at one newly raising task in vision and multimedia research:
recognizing human actions from still images. Its main challenges lie in the
large variations in human poses and appearances, as well as the lack of
temporal motion information. Addressing these problems, we propose to develop
an expressive deep model to naturally integrate human layout and surrounding
contexts for higher level action understanding from still images. In
particular, a Deep Belief Net is trained to fuse information from different
noisy sources such as body part detection and object detection. To bridge the
semantic gap, we used manually labeled data to greatly improve the
effectiveness and efficiency of the pre-training and fine-tuning stages of the
DBN training. The resulting framework is shown to be robust to sometimes
unreliable inputs (e.g., imprecise detections of human parts and objects), and
outperforms the state-of-the-art approaches.Comment: 6 pages, 8 figures, ICME 201
Modeling Latent Variable Uncertainty for Loss-based Learning
We consider the problem of parameter estimation using weakly supervised
datasets, where a training sample consists of the input and a partially
specified annotation, which we refer to as the output. The missing information
in the annotation is modeled using latent variables. Previous methods
overburden a single distribution with two separate tasks: (i) modeling the
uncertainty in the latent variables during training; and (ii) making accurate
predictions for the output and the latent variables during testing. We propose
a novel framework that separates the demands of the two tasks using two
distributions: (i) a conditional distribution to model the uncertainty of the
latent variables for a given input-output pair; and (ii) a delta distribution
to predict the output and the latent variables for a given input. During
learning, we encourage agreement between the two distributions by minimizing a
loss-based dissimilarity coefficient. Our approach generalizes latent SVM in
two important ways: (i) it models the uncertainty over latent variables instead
of relying on a pointwise estimate; and (ii) it allows the use of loss
functions that depend on latent variables, which greatly increases its
applicability. We demonstrate the efficacy of our approach on two challenging
problems---object detection and action detection---using publicly available
datasets.Comment: ICML201
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
- …