1,658 research outputs found
Action-Agnostic Human Pose Forecasting
Predicting and forecasting human dynamics is a very interesting but
challenging task with several prospective applications in robotics,
health-care, etc. Recently, several methods have been developed for human pose
forecasting; however, they often introduce a number of limitations in their
settings. For instance, previous work either focused only on short-term or
long-term predictions, while sacrificing one or the other. Furthermore, they
included the activity labels as part of the training process, and require them
at testing time. These limitations confine the usage of pose forecasting models
for real-world applications, as often there are no activity-related annotations
for testing scenarios. In this paper, we propose a new action-agnostic method
for short- and long-term human pose forecasting. To this end, we propose a new
recurrent neural network for modeling the hierarchical and multi-scale
characteristics of the human dynamics, denoted by triangular-prism RNN
(TP-RNN). Our model captures the latent hierarchical structure embedded in
temporal human pose sequences by encoding the temporal dependencies with
different time-scales. For evaluation, we run an extensive set of experiments
on Human 3.6M and Penn Action datasets and show that our method outperforms
baseline and state-of-the-art methods quantitatively and qualitatively. Codes
are available at https://github.com/eddyhkchiu/pose_forecast_wacv/Comment: Accepted for publication in WACV 201
Memory Attention Networks for Skeleton-based Action Recognition
Skeleton-based action recognition task is entangled with complex
spatio-temporal variations of skeleton joints, and remains challenging for
Recurrent Neural Networks (RNNs). In this work, we propose a
temporal-then-spatial recalibration scheme to alleviate such complex
variations, resulting in an end-to-end Memory Attention Networks (MANs) which
consist of a Temporal Attention Recalibration Module (TARM) and a
Spatio-Temporal Convolution Module (STCM). Specifically, the TARM is deployed
in a residual learning module that employs a novel attention learning network
to recalibrate the temporal attention of frames in a skeleton sequence. The
STCM treats the attention calibrated skeleton joint sequences as images and
leverages the Convolution Neural Networks (CNNs) to further model the spatial
and temporal information of skeleton data. These two modules (TARM and STCM)
seamlessly form a single network architecture that can be trained in an
end-to-end fashion. MANs significantly boost the performance of skeleton-based
action recognition and achieve the best results on four challenging benchmark
datasets: NTU RGB+D, HDM05, SYSU-3D and UT-Kinect.Comment: Accepted by IJCAI 201
Cell-aware Stacked LSTMs for Modeling Sentences
We propose a method of stacking multiple long short-term memory (LSTM) layers
for modeling sentences. In contrast to the conventional stacked LSTMs where
only hidden states are fed as input to the next layer, the suggested
architecture accepts both hidden and memory cell states of the preceding layer
and fuses information from the left and the lower context using the soft gating
mechanism of LSTMs. Thus the architecture modulates the amount of information
to be delivered not only in horizontal recurrence but also in vertical
connections, from which useful features extracted from lower layers are
effectively conveyed to upper layers. We dub this architecture Cell-aware
Stacked LSTM (CAS-LSTM) and show from experiments that our models bring
significant performance gain over the standard LSTMs on benchmark datasets for
natural language inference, paraphrase detection, sentiment classification, and
machine translation. We also conduct extensive qualitative analysis to
understand the internal behavior of the suggested approach.Comment: ACML 201
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
Recent two-stream deep Convolutional Neural Networks (ConvNets) have made
significant progress in recognizing human actions in videos. Despite their
success, methods extending the basic two-stream ConvNet have not systematically
explored possible network architectures to further exploit spatiotemporal
dynamics within video sequences. Further, such networks often use different
baseline two-stream networks. Therefore, the differences and the distinguishing
factors between various methods using Recurrent Neural Networks (RNN) or
convolutional networks on temporally-constructed feature vectors
(Temporal-ConvNet) are unclear. In this work, we first demonstrate a strong
baseline two-stream ConvNet using ResNet-101. We use this baseline to
thoroughly examine the use of both RNNs and Temporal-ConvNets for extracting
spatiotemporal information. Building upon our experimental results, we then
propose and investigate two different networks to further integrate
spatiotemporal information: 1) temporal segment RNN and 2) Inception-style
Temporal-ConvNet. We demonstrate that using both RNNs (using LSTMs) and
Temporal-ConvNets on spatiotemporal feature matrices are able to exploit
spatiotemporal dynamics to improve the overall performance. However, each of
these methods require proper care to achieve state-of-the-art performance; for
example, LSTMs require pre-segmented data or else they cannot fully exploit
temporal information. Our analysis identifies specific limitations for each
method that could form the basis of future work. Our experimental results on
UCF101 and HMDB51 datasets achieve state-of-the-art performances, 94.1% and
69.0%, respectively, without requiring extensive temporal augmentation.Comment: 16 pages, 11 figure
UTS submission to Google YouTube-8M Challenge 2017
In this paper, we present our solution to Google YouTube-8M Video
Classification Challenge 2017. We leveraged both video-level and frame-level
features in the submission. For video-level classification, we simply used a
200-mixture Mixture of Experts (MoE) layer, which achieves GAP 0.802 on the
validation set with a single model. For frame-level classification, we utilized
several variants of recurrent neural networks, sequence aggregation with
attention mechanism and 1D convolutional models. We achieved GAP 0.8408 on the
private testing set with the ensemble model.
The source code of our models can be found in
\url{https://github.com/ffmpbgrnn/yt8m}.Comment: CVPR'17 Workshop on YouTube-8
Deep Residual Bidir-LSTM for Human Activity Recognition Using Wearable Sensors
Human activity recognition (HAR) has become a popular topic in research
because of its wide application. With the development of deep learning, new
ideas have appeared to address HAR problems. Here, a deep network architecture
using residual bidirectional long short-term memory (LSTM) cells is proposed.
The advantages of the new network include that a bidirectional connection can
concatenate the positive time direction (forward state) and the negative time
direction (backward state). Second, residual connections between stacked cells
act as highways for gradients, which can pass underlying information directly
to the upper layer, effectively avoiding the gradient vanishing problem.
Generally, the proposed network shows improvements on both the temporal (using
bidirectional cells) and the spatial (residual connections stacked deeply)
dimensions, aiming to enhance the recognition rate. When tested with the
Opportunity data set and the public domain UCI data set, the accuracy was
increased by 4.78% and 3.68%, respectively, compared with previously reported
results. Finally, the confusion matrix of the public domain UCI data set was
analyzed
Lattice Long Short-Term Memory for Human Action Recognition
Human actions captured in video sequences are three-dimensional signals
characterizing visual appearance and motion dynamics. To learn action patterns,
existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and
RNNs). CNN based methods are effective in learning spatial appearances, but are
limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term
Memory (LSTM), are able to learn temporal motion dynamics. However, naively
applying RNNs to video sequences in a convolutional manner implicitly assumes
that motions in videos are stationary across different spatial locations. This
assumption is valid for short-term motions but invalid when the duration of the
motion is long.
In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning
independent hidden state transitions of memory cells for individual spatial
locations. This method effectively enhances the ability to model dynamics
across time and addresses the non-stationary issue of long-term motion dynamics
without significantly increasing the model complexity. Additionally, we
introduce a novel multi-modal training procedure for training our network.
Unlike traditional two-stream architectures which use RGB and optical flow
information as input, our two-stream model leverages both modalities to jointly
train both input gates and both forget gates in the network rather than
treating the two streams as separate entities with no information about the
other. We apply this end-to-end system to benchmark datasets (UCF-101 and
HMDB-51) of human action recognition. Experiments show that on both datasets,
our proposed method outperforms all existing ones that are based on LSTM and/or
CNNs of similar model complexities.Comment: ICCV201
Recent Advances in Deep Learning: An Overview
Deep Learning is one of the newest trends in Machine Learning and Artificial
Intelligence research. It is also one of the most popular scientific research
trends now-a-days. Deep learning methods have brought revolutionary advances in
computer vision and machine learning. Every now and then, new and new deep
learning techniques are being born, outperforming state-of-the-art machine
learning and even existing deep learning techniques. In recent years, the world
has seen many major breakthroughs in this field. Since deep learning is
evolving at a huge speed, its kind of hard to keep track of the regular
advances especially for new researchers. In this paper, we are going to briefly
discuss about recent advances in Deep Learning for past few years.Comment: 31 pages including bibliograph
Tensor-Train Recurrent Neural Networks for Video Classification
The Recurrent Neural Networks and their variants have shown promising
performances in sequence modeling tasks such as Natural Language Processing.
These models, however, turn out to be impractical and difficult to train when
exposed to very high-dimensional inputs due to the large input-to-hidden weight
matrix. This may have prevented RNNs' large-scale application in tasks that
involve very high input dimensions such as video modeling; current approaches
reduce the input dimensions using various feature extractors. To address this
challenge, we propose a new, more general and efficient approach by factorizing
the input-to-hidden weight matrix using Tensor-Train decomposition which is
trained simultaneously with the weights themselves. We test our model on
classification tasks using multiple real-world video datasets and achieve
competitive performances with state-of-the-art models, even though our model
architecture is orders of magnitude less complex. We believe that the proposed
approach provides a novel and fundamental building block for modeling
high-dimensional sequential data with RNN architectures and opens up many
possibilities to transfer the expressive and advanced architectures from other
domains such as NLP to modeling high-dimensional sequential data
Multi-Level Recurrent Residual Networks for Action Recognition
Most existing Convolutional Neural Networks(CNNs) used for action recognition
are either difficult to optimize or underuse crucial temporal information.
Inspired by the fact that the recurrent model consistently makes breakthroughs
in the task related to sequence, we propose a novel Multi-Level Recurrent
Residual Networks(MRRN) which incorporates three recognition streams. Each
stream consists of a Residual Networks(ResNets) and a recurrent model. The
proposed model captures spatiotemporal information by employing both
alternative ResNets to learn spatial representations from static frames and
stacked Simple Recurrent Units(SRUs) to model temporal dynamics. Three
distinct-level streams learned low-, mid-, high-level representations
independently are fused by computing a weighted average of their softmax scores
to obtain the complementary representations of the video. Unlike previous
models which boost performance at the cost of time complexity and space
complexity, our models have a lower complexity by employing shortcut connection
and are trained end-to-end with greater efficiency. MRRN displays significant
performance improvements compared to CNN-RNN framework baselines and obtains
comparable performance with the state-of-the-art, achieving 51.3% on HMDB-51
dataset and 81.9% on UCF-101 dataset although no additional data
- …