8,960 research outputs found

    Lattice Long Short-Term Memory for Human Action Recognition

    Full text link
    Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities.Comment: ICCV201

    Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks

    Full text link
    We propose an off-line approach to explicitly encode temporal patterns spatially as different types of images, namely, Gramian Angular Fields and Markov Transition Fields. This enables the use of techniques from computer vision for feature learning and classification. We used Tiled Convolutional Neural Networks to learn high-level features from individual GAF, MTF, and GAF-MTF images on 12 benchmark time series datasets and two real spatial-temporal trajectory datasets. The classification results of our approach are competitive with state-of-the-art approaches on both types of data. An analysis of the features and weights learned by the CNNs explains why the approach works.Comment: Submit to JCSS. Preliminary versions are appeared in AAAI 2015 workshop and IJCAI 2016 [arXiv:1506.00327

    Spatiotemporal Patterns in Arrays of Coupled Nonlinear Oscillators

    Full text link
    Nonlinear reaction-diffusion systems admit a wide variety of spatiotemporal patterns or structures. In this lecture, we point out that there is certain advantage in studying discrete arrays, namely cellular neural/nonlinear networks (CNNs), over continuous systems. Then, to illustrate these ideas, the dynamics of diffusively coupled one and two dimensional cellular nonlinear networks (CNNs), involving Murali-Lakshmanan-Chua circuit as the basic element, is considered. Propagation failure in the case of uniform diffusion and propagation blocking in the case of defects are pointed out. The mechanism behind these phenomena in terms of loss of stability is explained. Various spatiotemporal patterns arising from diffusion driven instability such as hexagons, rhombous and rolls are considered when external forces are absent. Existence of penta-hepta defects and removal of them due to external forcing is discussed. The transition from hexagonal to roll structure and breathing oscillations in the presence of external forcing is also demonstrated. Further spatiotemporal chaos, synchronization and size instability in the coupled chaotic systems are elucidated.Comment: 25 pages, LaTeX2e, 14 EPS figures, to appear in Proc. Indian National Science Academy Vol. 66A (2000); Based on the Dr. Biren Roy Memorial Lecture delivered at Jawaharlal Nehru University, New Delhi by M. Lakshmanan on 27 October 199

    Learning spectro-temporal features with 3D CNNs for speech emotion recognition

    Full text link
    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.Comment: ACII, 2017, San Antoni

    Deep CNNs along the Time Axis with Intermap Pooling for Robustness to Spectral Variations

    Full text link
    Convolutional neural networks (CNNs) with convolutional and pooling operations along the frequency axis have been proposed to attain invariance to frequency shifts of features. However, this is inappropriate with regard to the fact that acoustic features vary in frequency. In this paper, we contend that convolution along the time axis is more effective. We also propose the addition of an intermap pooling (IMP) layer to deep CNNs. In this layer, filters in each group extract common but spectrally variant features, then the layer pools the feature maps of each group. As a result, the proposed IMP CNN can achieve insensitivity to spectral variations characteristic of different speakers and utterances. The effectiveness of the IMP CNN architecture is demonstrated on several LVCSR tasks. Even without speaker adaptation techniques, the architecture achieved a WER of 12.7% on the SWB part of the Hub5'2000 evaluation test set, which is competitive with other state-of-the-art methods.Comment: Submitted to IEEE Signal Processing Letter

    Deep learning methods based on cross-section images for predicting effective thermal conductivity of composites

    Full text link
    Effective thermal conductivity is an important property of composites for different thermal management applications. Although physics-based methods, such as effective medium theory and solving partial differential equation, dominate the relevant research, there is significant interest to establish the structure-property linkage through the machine learning method. The performance of general machine learning methods is highly dependent on features selected to represent the microstructures. 3D convolutional neural networks (CNNs) can directly extract geometric features of composites, which have been demonstrated to establish structure-property linkages with high accuracy. However, to obtain the 3D microstructure in composite is generally challenging in reality. In this work, we attempt to use 2D cross-section images which can be easier to obtain in real applications as input of 2D CNNs to predict effective thermal conductivity of 3D composites. The results show that by using multiple cross-section images along or perpendicular to the preferred directionality of the fillers, the prediction accuracy of 2D CNNs can be as good as 3D CNNs. Such a result is demonstrated with the particle filled composite and a stochastic complex composite. The prediction accuracy is dependent on the representativeness of cross-section images used. Multiple cross-section images can fully determine the shape and distribution of fillers. The average over multiple images and the use of large-size images can reduce the uncertainty and increase the prediction accuracy. Besides, since cross-section images along the heat flow direction can distinguish between serial structures and parallel structures, they are more representative than cross-section images perpendicular to the heat flow direction

    Memory-Augmented Temporal Dynamic Learning for Action Recognition

    Full text link
    Human actions captured in video sequences contain two crucial factors for action recognition, i.e., visual appearance and motion dynamics. To model these two aspects, Convolutional and Recurrent Neural Networks (CNNs and RNNs) are adopted in most existing successful methods for recognizing actions. However, CNN based methods are limited in modeling long-term motion dynamics. RNNs are able to learn temporal motion dynamics but lack effective ways to tackle unsteady dynamics in long-duration motion. In this work, we propose a memory-augmented temporal dynamic learning network, which learns to write the most evident information into an external memory module and ignore irrelevant ones. In particular, we present a differential memory controller to make a discrete decision on whether the external memory module should be updated with current feature. The discrete memory controller takes in the memory history, context embedding and current feature as inputs and controls information flow into the external memory module. Additionally, we train this discrete memory controller using straight-through estimator. We evaluate this end-to-end system on benchmark datasets (UCF101 and HMDB51) of human action recognition. The experimental results show consistent improvements on both datasets over prior works and our baselines.Comment: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    End-to-End Training of Deep Visuomotor Policies

    Full text link
    Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.Comment: updating with revisions for JMLR final versio

    Temporal Bilinear Networks for Video Action Recognition

    Full text link
    Temporal modeling in videos is a fundamental yet challenging problem in computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames. Compared with some existing temporal methods which are limited in linear transformations, our TB model considers explicit quadratic bilinear transformations in the temporal domain for motion evolution and sequential relation modeling. We further leverage the factorized bilinear model in linear complexity and a bottleneck network design to build our TB blocks, which also constrains the parameters and computation cost. We consider two schemes in terms of the incorporation of TB blocks and the original 2D spatial convolutions, namely wide and deep Temporal Bilinear Networks (TBN). Finally, we perform experiments on several widely adopted datasets including Kinetics, UCF101 and HMDB51. The effectiveness of our TBNs is validated by comprehensive ablation analyses and comparisons with various state-of-the-art methods.Comment: Accepted by AAAI 201

    Comparison of Deep Neural Networks and Deep Hierarchical Models for Spatio-Temporal Data

    Full text link
    Spatio-temporal data are ubiquitous in the agricultural, ecological, and environmental sciences, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with modeling spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex data sets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically and efficient computationally. Statisticians have developed deep hierarchical models that can accommodate process complexity as well as the uncertainties in the predictions and inference. However, these models can be expensive and are typically application specific. On the other hand, the machine learning community has developed alternative "deep learning" approaches for nonlinear spatio-temporal modeling. These models are flexible yet are typically not implemented in a probabilistic framework. The two paradigms have many things in common and suggest hybrid approaches that can benefit from elements of each framework. This overview paper presents a brief introduction to the deep hierarchical DSTM (DH-DSTM) framework, and deep models in machine learning, culminating with the deep neural DSTM (DN-DSTM). Recent approaches that combine elements from DH-DSTMs and echo state network DN-DSTMs are presented as illustrations.Comment: 26 pages, including 6 figures and reference
    • …
    corecore