2,638 research outputs found
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for
video analysis and study their effects on action recognition. Our motivation
stems from the observation that 2D CNNs applied to individual frames of the
video have remained solid performers in action recognition. In this work we
empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within
the framework of residual learning. Furthermore, we show that factorizing the
3D convolutional filters into separate spatial and temporal components yields
significantly advantages in accuracy. Our empirical study leads to the design
of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs
that achieve results comparable or superior to the state-of-the-art on
Sports-1M, Kinetics, UCF101 and HMDB51
Spatio-Temporal FAST 3D Convolutions for Human Action Recognition
Effective processing of video input is essential for the recognition of
temporally varying events such as human actions. Motivated by the often
distinctive temporal characteristics of actions in either horizontal or
vertical direction, we introduce a novel convolution block for CNN
architectures with video input. Our proposed Fractioned Adjacent Spatial and
Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D
convolution. Each convolution block consist of three sequential convolution
operations: a 2D spatial convolution followed by spatio-temporal convolutions
in the horizontal and vertical direction, respectively. Additionally, we
introduce a FAST variant that treats horizontal and vertical motion in
parallel. Experiments on benchmark action recognition datasets UCF-101 and
HMDB-51 with ResNet architectures demonstrate consistent increased performance
of FAST 3D convolution blocks over traditional 3D convolutions. The lower
validation loss indicates better generalization, especially for deeper
networks. We also evaluate the performance of CNN architectures with similar
memory requirements, based either on Two-stream networks or with 3D convolution
blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best,
giving further evidence of the merits of the decoupled spatio-temporal
convolutions
Enhancing representation learning with tensor decompositions for knowledge graphs and high dimensional sequence modeling
The capability of processing and digesting raw data is one of the key features of a human-like artificial intelligence system. For instance, real-time machine translation should be able to process and understand spoken natural language, and autonomous driving relies on the comprehension of visual inputs. Representation learning is a class of machine learning techniques that autonomously learn to derive latent features from raw data. These new features are expected to represent the data instances in a vector space that facilitates the machine learning task. This thesis studies two specific data situations that require efficient representation learning: knowledge graph data and high dimensional sequences.
In the first part of this thesis, we first review multiple relational learning models based on tensor decomposition for knowledge graphs. We point out that relational learning is in fact a means of learning representations through one-hot mapping of entities. Furthermore, we generalize this mapping function to consume a feature vector that encodes all known facts about each entity. It enables the relational model to derive the latent representation instantly for a new entity, without having to re-train the tensor decomposition.
In the second part, we focus on learning representations from high dimensional sequential data. Sequential data often pose the challenge that they are of variable lengths. Electronic health records, for instance, could consist of clinical event data that have been collected at subsequent time steps. But each patient may have a medical history of variable length. We apply recurrent neural networks to produce fixed-size latent representations from the raw feature sequences of various lengths. By exposing a prediction model to these learned representations instead of the raw features, we can predict the therapy prescriptions more accurately as a means of clinical decision support. We further propose Tensor-Train recurrent neural networks. We give a detailed introduction to the technique of tensorizing and decomposing large weight matrices into a few smaller tensors. We demonstrate the specific algorithms to perform the forward-pass and the back-propagation in this setting.
Then we apply this approach to the input-to-hidden weight matrix in recurrent neural networks. This novel architecture can process extremely high dimensional sequential features such as video data. The model also provides a promising solution to processing sequential features with high sparsity. This is, for instance, the case with electronic health records, since they are often of categorical nature and have to be binary-coded. We incorporate a statistical survival model with this representation learning model, which shows superior prediction quality
- …