109,523 research outputs found
Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition
Recently, skeleton-based human action has become a hot research topic because
the compact representation of human skeletons brings new blood to this research
domain. As a result, researchers began to notice the importance of using RGB or
other sensors to analyze human action by extracting skeleton information.
Leveraging the rapid development of deep learning (DL), a significant number of
skeleton-based human action approaches have been presented with fine-designed
DL structures recently. However, a well-trained DL model always demands
high-quality and sufficient data, which is hard to obtain without costing high
expenses and human labor. In this paper, we introduce a novel data augmentation
method for skeleton-based action recognition tasks, which can effectively
generate high-quality and diverse sequential actions. In order to obtain
natural and realistic action sequences, we propose denoising diffusion
probabilistic models (DDPMs) that can generate a series of synthetic action
sequences, and their generation process is precisely guided by a
spatial-temporal transformer (ST-Trans). Experimental results show that our
method outperforms the state-of-the-art (SOTA) motion generation approaches on
different naturality and diversity metrics. It proves that its high-quality
synthetic data can also be effectively deployed to existing action recognition
models with significant performance improvement
From Line Drawings to Human Actions: Deep Neural Networks for Visual Data Representation
In recent years, deep neural networks have been very successful
in computer vision, speech recognition, and artificial
intelligent systems. The rapid growth of data and fast increasing
computational tools provide solid foundations for the
applications which rely on the learning of large scale deep
neural networks with millions of parameters. The deep learning
approaches have been proved to be able to learn powerful
representations of the inputs in various tasks, such as image
classification, object recognition, and scene understanding. This
thesis demonstrates the generality and capacity of deep learning
approaches through a series of case studies including image
matching and human activity understanding. In these studies, I
explore the combinations of the neural network models with
existing machine learning techniques and extend the deep learning
approach for each task. Four related tasks are investigated: 1)
image matching through similarity learning; 2) human action
prediction; 3) finger force estimation in manipulation actions;
and 4) bimodal learning for human action understanding.
Deep neural networks have been shown to be very efficient in
supervised learning. Further, in some tasks, one would like to
group the features of the samples in the same category close to
each other, in additional to the discriminative representation.
Such kind of properties is desired in a number of applications,
such as semantic retrieval, image quality measurement, and social
network analysis, etc. My first study is to develop a similarity
learning method based on deep neural networks for image matching
between sketch images and 3D models. In this task, I propose to
use Siamese network to learn similarities of sketches and develop
a novel method for sketch based 3D shape retrieval. The proposed
method can successfully learn the representations of sketch
images as well as the similarities, then the 3D shape retrieval
problem can be solved with off-the-shelf nearest neighbor
methods.
After studying the representation learning methods for static
inputs, my focus turns to learning the representations of
sequential data. To be specific, I focus on manipulation actions,
because they are widely used in the daily life and play important
parts in the human-robot collaboration system. Deep neural
networks have been shown to be powerful to represent short video
clips [Donahue et al., 2015]. However, most existing methods
consider the action recognition problem as a classification task.
These methods assume the inputs are pre-segmented videos and the
outputs are category labels. In the scenarios such as the
human-robot collaboration system, the ability to predict the
ongoing human actions at an early stage is highly important. I
first attempt to address this issue with a fast manipulation
action prediction method. Then I build the action prediction
model based on Long Short-Term Memory (LSTM) architecture. The
proposed approach processes the sequential inputs as continuous
signals and keeps updating the prediction of the intended action
based on the learned action representations.
Further, I study the relationships between visual inputs and the
physical information, such as finger forces, that involved in the
manipulation actions. This is motivated by recent studies in
cognitive science which show that the subject’s intention is
strongly related to the hand movements during an action
execution. Human observers can interpret other’s actions in
terms of movements and forces, which can be used to repeat the
observed actions. If a robot system has the ability to estimate
the force feedbacks, it can learn how to manipulate an object by
watching human demonstrations. In this work, the finger forces
are estimated by only watching the movement of hands. A modified
LSTM model is used to regress the finger forces from video
frames. To facilitate this study, a specially designed sensor
glove has been used to collect data of finger forces, and a new
dataset has been collected to provide synchronized streams of
videos and finger forces.
Last, I investigate the usefulness of physical information in
human action recognition, which is an application of bimodal
learning, where both the vision inputs and the additional
information are used to learn the action representation. My study
demonstrates that, by combining additional information with the
vision inputs, the accuracy of human action recognition can be
improved steadily. I extend the LSTM architecture to accept both
video frames and sensor data as bimodal inputs to predict the
action. A hallucination network is jointly trained to approximate
the representations of the additional inputs. During the testing
stage, the hallucination network generates approximated
representations that used for classification. In this way, the
proposed method does not rely on the additional inputs for
testing
Understanding of Object Manipulation Actions Using Human Multi-Modal Sensory Data
Object manipulation actions represent an important share of the Activities of
Daily Living (ADLs). In this work, we study how to enable service robots to use
human multi-modal data to understand object manipulation actions, and how they
can recognize such actions when humans perform them during human-robot
collaboration tasks. The multi-modal data in this study consists of videos,
hand motion data, applied forces as represented by the pressure patterns on the
hand, and measurements of the bending of the fingers, collected as human
subjects performed manipulation actions. We investigate two different
approaches. In the first one, we show that multi-modal signal (motion, finger
bending and hand pressure) generated by the action can be decomposed into a set
of primitives that can be seen as its building blocks. These primitives are
used to define 24 multi-modal primitive features. The primitive features can in
turn be used as an abstract representation of the multi-modal signal and
employed for action recognition. In the latter approach, the visual features
are extracted from the data using a pre-trained image classification deep
convolutional neural network. The visual features are subsequently used to
train the classifier. We also investigate whether adding data from other
modalities produces a statistically significant improvement in the classifier
performance. We show that both approaches produce a comparable performance.
This implies that image-based methods can successfully recognize human actions
during human-robot collaboration. On the other hand, in order to provide
training data for the robot so it can learn how to perform object manipulation
actions, multi-modal data provides a better alternative
- …