In recent years, deep neural networks have been very successful
in computer vision, speech recognition, and artificial
intelligent systems. The rapid growth of data and fast increasing
computational tools provide solid foundations for the
applications which rely on the learning of large scale deep
neural networks with millions of parameters. The deep learning
approaches have been proved to be able to learn powerful
representations of the inputs in various tasks, such as image
classification, object recognition, and scene understanding. This
thesis demonstrates the generality and capacity of deep learning
approaches through a series of case studies including image
matching and human activity understanding. In these studies, I
explore the combinations of the neural network models with
existing machine learning techniques and extend the deep learning
approach for each task. Four related tasks are investigated: 1)
image matching through similarity learning; 2) human action
prediction; 3) finger force estimation in manipulation actions;
and 4) bimodal learning for human action understanding.
Deep neural networks have been shown to be very efficient in
supervised learning. Further, in some tasks, one would like to
group the features of the samples in the same category close to
each other, in additional to the discriminative representation.
Such kind of properties is desired in a number of applications,
such as semantic retrieval, image quality measurement, and social
network analysis, etc. My first study is to develop a similarity
learning method based on deep neural networks for image matching
between sketch images and 3D models. In this task, I propose to
use Siamese network to learn similarities of sketches and develop
a novel method for sketch based 3D shape retrieval. The proposed
method can successfully learn the representations of sketch
images as well as the similarities, then the 3D shape retrieval
problem can be solved with off-the-shelf nearest neighbor
methods.
After studying the representation learning methods for static
inputs, my focus turns to learning the representations of
sequential data. To be specific, I focus on manipulation actions,
because they are widely used in the daily life and play important
parts in the human-robot collaboration system. Deep neural
networks have been shown to be powerful to represent short video
clips [Donahue et al., 2015]. However, most existing methods
consider the action recognition problem as a classification task.
These methods assume the inputs are pre-segmented videos and the
outputs are category labels. In the scenarios such as the
human-robot collaboration system, the ability to predict the
ongoing human actions at an early stage is highly important. I
first attempt to address this issue with a fast manipulation
action prediction method. Then I build the action prediction
model based on Long Short-Term Memory (LSTM) architecture. The
proposed approach processes the sequential inputs as continuous
signals and keeps updating the prediction of the intended action
based on the learned action representations.
Further, I study the relationships between visual inputs and the
physical information, such as finger forces, that involved in the
manipulation actions. This is motivated by recent studies in
cognitive science which show that the subject’s intention is
strongly related to the hand movements during an action
execution. Human observers can interpret other’s actions in
terms of movements and forces, which can be used to repeat the
observed actions. If a robot system has the ability to estimate
the force feedbacks, it can learn how to manipulate an object by
watching human demonstrations. In this work, the finger forces
are estimated by only watching the movement of hands. A modified
LSTM model is used to regress the finger forces from video
frames. To facilitate this study, a specially designed sensor
glove has been used to collect data of finger forces, and a new
dataset has been collected to provide synchronized streams of
videos and finger forces.
Last, I investigate the usefulness of physical information in
human action recognition, which is an application of bimodal
learning, where both the vision inputs and the additional
information are used to learn the action representation. My study
demonstrates that, by combining additional information with the
vision inputs, the accuracy of human action recognition can be
improved steadily. I extend the LSTM architecture to accept both
video frames and sensor data as bimodal inputs to predict the
action. A hallucination network is jointly trained to approximate
the representations of the additional inputs. During the testing
stage, the hallucination network generates approximated
representations that used for classification. In this way, the
proposed method does not rely on the additional inputs for
testing