8,960 research outputs found
Lattice Long Short-Term Memory for Human Action Recognition
Human actions captured in video sequences are three-dimensional signals
characterizing visual appearance and motion dynamics. To learn action patterns,
existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and
RNNs). CNN based methods are effective in learning spatial appearances, but are
limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term
Memory (LSTM), are able to learn temporal motion dynamics. However, naively
applying RNNs to video sequences in a convolutional manner implicitly assumes
that motions in videos are stationary across different spatial locations. This
assumption is valid for short-term motions but invalid when the duration of the
motion is long.
In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning
independent hidden state transitions of memory cells for individual spatial
locations. This method effectively enhances the ability to model dynamics
across time and addresses the non-stationary issue of long-term motion dynamics
without significantly increasing the model complexity. Additionally, we
introduce a novel multi-modal training procedure for training our network.
Unlike traditional two-stream architectures which use RGB and optical flow
information as input, our two-stream model leverages both modalities to jointly
train both input gates and both forget gates in the network rather than
treating the two streams as separate entities with no information about the
other. We apply this end-to-end system to benchmark datasets (UCF-101 and
HMDB-51) of human action recognition. Experiments show that on both datasets,
our proposed method outperforms all existing ones that are based on LSTM and/or
CNNs of similar model complexities.Comment: ICCV201
Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks
We propose an off-line approach to explicitly encode temporal patterns
spatially as different types of images, namely, Gramian Angular Fields and
Markov Transition Fields. This enables the use of techniques from computer
vision for feature learning and classification. We used Tiled Convolutional
Neural Networks to learn high-level features from individual GAF, MTF, and
GAF-MTF images on 12 benchmark time series datasets and two real
spatial-temporal trajectory datasets. The classification results of our
approach are competitive with state-of-the-art approaches on both types of
data. An analysis of the features and weights learned by the CNNs explains why
the approach works.Comment: Submit to JCSS. Preliminary versions are appeared in AAAI 2015
workshop and IJCAI 2016 [arXiv:1506.00327
Spatiotemporal Patterns in Arrays of Coupled Nonlinear Oscillators
Nonlinear reaction-diffusion systems admit a wide variety of spatiotemporal
patterns or structures. In this lecture, we point out that there is certain
advantage in studying discrete arrays, namely cellular neural/nonlinear
networks (CNNs), over continuous systems. Then, to illustrate these ideas, the
dynamics of diffusively coupled one and two dimensional cellular nonlinear
networks (CNNs), involving Murali-Lakshmanan-Chua circuit as the basic element,
is considered. Propagation failure in the case of uniform diffusion and
propagation blocking in the case of defects are pointed out. The mechanism
behind these phenomena in terms of loss of stability is explained. Various
spatiotemporal patterns arising from diffusion driven instability such as
hexagons, rhombous and rolls are considered when external forces are absent.
Existence of penta-hepta defects and removal of them due to external forcing is
discussed. The transition from hexagonal to roll structure and breathing
oscillations in the presence of external forcing is also demonstrated. Further
spatiotemporal chaos, synchronization and size instability in the coupled
chaotic systems are elucidated.Comment: 25 pages, LaTeX2e, 14 EPS figures, to appear in Proc. Indian National
Science Academy Vol. 66A (2000); Based on the Dr. Biren Roy Memorial Lecture
delivered at Jawaharlal Nehru University, New Delhi by M. Lakshmanan on 27
October 199
Learning spectro-temporal features with 3D CNNs for speech emotion recognition
In this paper, we propose to use deep 3-dimensional convolutional networks
(3D CNNs) in order to address the challenge of modelling spectro-temporal
dynamics for speech emotion recognition (SER). Compared to a hybrid of
Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our
proposed 3D CNNs simultaneously extract short-term and long-term spectral
features with a moderate number of parameters. We evaluated our proposed and
other state-of-the-art methods in a speaker-independent manner using aggregated
corpora that give a large and diverse set of speakers. We found that 1) shallow
temporal and moderately deep spectral kernels of a homogeneous architecture are
optimal for the task; and 2) our 3D CNNs are more effective for
spectro-temporal feature learning compared to other methods. Finally, we
visualised the feature space obtained with our proposed method using
t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct
clusters of emotions.Comment: ACII, 2017, San Antoni
Deep CNNs along the Time Axis with Intermap Pooling for Robustness to Spectral Variations
Convolutional neural networks (CNNs) with convolutional and pooling
operations along the frequency axis have been proposed to attain invariance to
frequency shifts of features. However, this is inappropriate with regard to the
fact that acoustic features vary in frequency. In this paper, we contend that
convolution along the time axis is more effective. We also propose the addition
of an intermap pooling (IMP) layer to deep CNNs. In this layer, filters in each
group extract common but spectrally variant features, then the layer pools the
feature maps of each group. As a result, the proposed IMP CNN can achieve
insensitivity to spectral variations characteristic of different speakers and
utterances. The effectiveness of the IMP CNN architecture is demonstrated on
several LVCSR tasks. Even without speaker adaptation techniques, the
architecture achieved a WER of 12.7% on the SWB part of the Hub5'2000
evaluation test set, which is competitive with other state-of-the-art methods.Comment: Submitted to IEEE Signal Processing Letter
Deep learning methods based on cross-section images for predicting effective thermal conductivity of composites
Effective thermal conductivity is an important property of composites for
different thermal management applications. Although physics-based methods, such
as effective medium theory and solving partial differential equation, dominate
the relevant research, there is significant interest to establish the
structure-property linkage through the machine learning method. The performance
of general machine learning methods is highly dependent on features selected to
represent the microstructures. 3D convolutional neural networks (CNNs) can
directly extract geometric features of composites, which have been demonstrated
to establish structure-property linkages with high accuracy. However, to obtain
the 3D microstructure in composite is generally challenging in reality. In this
work, we attempt to use 2D cross-section images which can be easier to obtain
in real applications as input of 2D CNNs to predict effective thermal
conductivity of 3D composites. The results show that by using multiple
cross-section images along or perpendicular to the preferred directionality of
the fillers, the prediction accuracy of 2D CNNs can be as good as 3D CNNs. Such
a result is demonstrated with the particle filled composite and a stochastic
complex composite. The prediction accuracy is dependent on the
representativeness of cross-section images used. Multiple cross-section images
can fully determine the shape and distribution of fillers. The average over
multiple images and the use of large-size images can reduce the uncertainty and
increase the prediction accuracy. Besides, since cross-section images along the
heat flow direction can distinguish between serial structures and parallel
structures, they are more representative than cross-section images
perpendicular to the heat flow direction
Memory-Augmented Temporal Dynamic Learning for Action Recognition
Human actions captured in video sequences contain two crucial factors for
action recognition, i.e., visual appearance and motion dynamics. To model these
two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are
adopted in most existing successful methods for recognizing actions. However,
CNN based methods are limited in modeling long-term motion dynamics. RNNs are
able to learn temporal motion dynamics but lack effective ways to tackle
unsteady dynamics in long-duration motion. In this work, we propose a
memory-augmented temporal dynamic learning network, which learns to write the
most evident information into an external memory module and ignore irrelevant
ones. In particular, we present a differential memory controller to make a
discrete decision on whether the external memory module should be updated with
current feature. The discrete memory controller takes in the memory history,
context embedding and current feature as inputs and controls information flow
into the external memory module. Additionally, we train this discrete memory
controller using straight-through estimator. We evaluate this end-to-end system
on benchmark datasets (UCF101 and HMDB51) of human action recognition. The
experimental results show consistent improvements on both datasets over prior
works and our baselines.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19
End-to-End Training of Deep Visuomotor Policies
Policy search methods can allow robots to learn control policies for a wide
range of tasks, but practical applications of policy search often require
hand-engineered components for perception, state estimation, and low-level
control. In this paper, we aim to answer the following question: does training
the perception and control systems jointly end-to-end provide better
performance than training each component separately? To this end, we develop a
method that can be used to learn policies that map raw image observations
directly to torques at the robot's motors. The policies are represented by deep
convolutional neural networks (CNNs) with 92,000 parameters, and are trained
using a partially observed guided policy search method, which transforms policy
search into supervised learning, with supervision provided by a simple
trajectory-centric reinforcement learning method. We evaluate our method on a
range of real-world manipulation tasks that require close coordination between
vision and control, such as screwing a cap onto a bottle, and present simulated
comparisons to a range of prior policy search methods.Comment: updating with revisions for JMLR final versio
Temporal Bilinear Networks for Video Action Recognition
Temporal modeling in videos is a fundamental yet challenging problem in
computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model
to capture the temporal pairwise feature interactions between adjacent frames.
Compared with some existing temporal methods which are limited in linear
transformations, our TB model considers explicit quadratic bilinear
transformations in the temporal domain for motion evolution and sequential
relation modeling. We further leverage the factorized bilinear model in linear
complexity and a bottleneck network design to build our TB blocks, which also
constrains the parameters and computation cost. We consider two schemes in
terms of the incorporation of TB blocks and the original 2D spatial
convolutions, namely wide and deep Temporal Bilinear Networks (TBN). Finally,
we perform experiments on several widely adopted datasets including Kinetics,
UCF101 and HMDB51. The effectiveness of our TBNs is validated by comprehensive
ablation analyses and comparisons with various state-of-the-art methods.Comment: Accepted by AAAI 201
Comparison of Deep Neural Networks and Deep Hierarchical Models for Spatio-Temporal Data
Spatio-temporal data are ubiquitous in the agricultural, ecological, and
environmental sciences, and their study is important for understanding and
predicting a wide variety of processes. One of the difficulties with modeling
spatial processes that change in time is the complexity of the dependence
structures that must describe how such a process varies, and the presence of
high-dimensional complex data sets and large prediction domains. It is
particularly challenging to specify parameterizations for nonlinear dynamic
spatio-temporal models (DSTMs) that are simultaneously useful scientifically
and efficient computationally. Statisticians have developed deep hierarchical
models that can accommodate process complexity as well as the uncertainties in
the predictions and inference. However, these models can be expensive and are
typically application specific. On the other hand, the machine learning
community has developed alternative "deep learning" approaches for nonlinear
spatio-temporal modeling. These models are flexible yet are typically not
implemented in a probabilistic framework. The two paradigms have many things in
common and suggest hybrid approaches that can benefit from elements of each
framework. This overview paper presents a brief introduction to the deep
hierarchical DSTM (DH-DSTM) framework, and deep models in machine learning,
culminating with the deep neural DSTM (DN-DSTM). Recent approaches that combine
elements from DH-DSTMs and echo state network DN-DSTMs are presented as
illustrations.Comment: 26 pages, including 6 figures and reference
- …