14 research outputs found
Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition
In this paper we address the problem of human action recognition from video
sequences. Inspired by the exemplary results obtained via automatic feature
learning and deep learning approaches in computer vision, we focus our
attention towards learning salient spatial features via a convolutional neural
network (CNN) and then map their temporal relationship with the aid of
Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a
deep fusion framework that more effectively exploits spatial features from CNNs
with temporal features from LSTM models. We also extensively evaluate their
strengths and weaknesses. We find that by combining both the sets of features,
the fully connected features effectively act as an attention mechanism to
direct the LSTM to interesting parts of the convolutional feature sequence. The
significance of our fusion method is its simplicity and effectiveness compared
to other state-of-the-art methods. The evaluation results demonstrate that this
hierarchical multi stream fusion method has higher performance compared to
single stream mapping methods allowing it to achieve high accuracy
outperforming current state-of-the-art methods in three widely used databases:
UCF11, UCFSports, jHMDB.Comment: Published as a conference paper at WACV 201
Integrated Inference and Learning of Neural Factors in Structural Support Vector Machines
Tackling pattern recognition problems in areas such as computer vision,
bioinformatics, speech or text recognition is often done best by taking into
account task-specific statistical relations between output variables. In
structured prediction, this internal structure is used to predict multiple
outputs simultaneously, leading to more accurate and coherent predictions.
Structural support vector machines (SSVMs) are nonprobabilistic models that
optimize a joint input-output function through margin-based learning. Because
SSVMs generally disregard the interplay between unary and interaction factors
during the training phase, final parameters are suboptimal. Moreover, its
factors are often restricted to linear combinations of input features, limiting
its generalization power. To improve prediction accuracy, this paper proposes:
(i) Joint inference and learning by integration of back-propagation and
loss-augmented inference in SSVM subgradient descent; (ii) Extending SSVM
factors to neural networks that form highly nonlinear functions of input
features. Image segmentation benchmark results demonstrate improvements over
conventional SSVM training methods in terms of accuracy, highlighting the
feasibility of end-to-end SSVM training with neural factors