48,615 research outputs found
Minimum-risk temporal alignment of videos
© 2017, Springer Science+Business Media, LLC. Temporal alignment of videos is an important requirement of tasks such as video comparison, analysis and classification. Most of the approaches proposed to date for video alignment leverage dynamic programming algorithms whose parameters are manually tuned. Conversely, this paper proposes a model that can learn its parameters automatically by minimizing a meaningful loss function over a given training set of videos and alignments. For learning, we exploit the effective framework of structural SVM and we extend it with an original scoring function that suitably scores the alignment of two given videos, and a loss function that quantifies the accuracy of a predicted alignment. The experimental results from four video action datasets show that the proposed model has been able to outperform a baseline and a state-of-the-art algorithm by a large margin in terms of alignment accuracy
Minimum-risk sequence alignment for the alignment and recognition of action videos
University of Technology Sydney. Faculty of Engineering and Information Technology.Temporal alignment of videos is an important requirement of tasks such as video comparison, analysis and classification. In the context of action analysis and action recognition, the main guiding element for the temporal alignment are the human actions depicted in the videos. While well-established alignment algorithms such as dynamic time warping are available, they still heavily rely on basic linear cost models and heuristic parameter tuning. Inspired by the success of the hidden Markov support vector machine for pairwise alignment of protein sequences, in this thesis we present a novel framework which combines the flexibility of a pair hidden Markov model (PHMM) with the effective parameter training of the structural support vector machine (SSVM). The framework extends the scoring function of SSVM to capture the similarity of two input frame sequences and introduces suitable feature and loss functions. During learning, we leverage these loss functions for regularised empirical risk minimisation and effective parameter selection.
We have carried out extensive experiments with the proposed technique (nicknamed as EHMM-SSVM) against state-of-the-art algorithms such as dynamic time warping (DTW) and generalized canonical time warping (GCTW) on pairs of human actions from four well-known datasets. The results show that the proposed model has been able to outperform the compared algorithms by a large margin in terms of alignment accuracy.
In the second part of this thesis we employ our alignment approach to tackle the task of human action recognition in video. This task is highly challenging due to the substantial variations in motion performance, recording settings and inter-personal differences. Most current research focuses on the extraction of effective features and the design of suitable classifiers. Conversely, in this thesis we tackle this problem by a dissimilarity-based approach where classification is performed in terms of minimum distance from templates and where the distance is based on the score of our alignment model, the EHMM-SSVM. In turn, the templates are chosen by means of prototype selection techniques from the available samples of each class. Experimental results over two popular human action datasets have showed that the proposed approach has been capable of achieving an accuracy higher than many existing methods and comparable to a state-of-the-art action classification algorithm
Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling
We present an approach for weakly supervised learning of human actions. Given
a set of videos and an ordered list of the occurring actions, the goal is to
infer start and end frames of the related action classes within the video and
to train the respective action classifiers without any need for hand labeled
frame boundaries. To address this task, we propose a combination of a
discriminative representation of subactions, modeled by a recurrent neural
network, and a coarse probabilistic model to allow for a temporal alignment and
inference over long sequences. While this system alone already generates good
results, we show that the performance can be further improved by approximating
the number of subactions to the characteristics of the different action
classes. To this end, we adapt the number of subaction classes by iterating
realignment and reestimation during training. The proposed system is evaluated
on two benchmark datasets, the Breakfast and the Hollywood extended dataset,
showing a competitive performance on various weak learning tasks such as
temporal action segmentation and action alignment
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
We propose a weakly-supervised framework for action labeling in video, where
only the order of occurring actions is required during training time. The key
challenge is that the per-frame alignments between the input (video) and label
(action) sequences are unknown during training. We address this by introducing
the Extended Connectionist Temporal Classification (ECTC) framework to
efficiently evaluate all possible alignments via dynamic programming and
explicitly enforce their consistency with frame-to-frame visual similarities.
This protects the model from distractions of visually inconsistent or
degenerated alignments without the need of temporal supervision. We further
extend our framework to the semi-supervised case when a few frames are sparsely
annotated in a video. With less than 1% of labeled frames per video, our method
is able to outperform existing semi-supervised approaches and achieve
comparable performance to that of fully supervised approaches.Comment: To appear in ECCV 201
Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories
In this paper, we propose a new approach for facial expression recognition
using deep covariance descriptors. The solution is based on the idea of
encoding local and global Deep Convolutional Neural Network (DCNN) features
extracted from still images, in compact local and global covariance
descriptors. The space geometry of the covariance matrices is that of Symmetric
Positive Definite (SPD) matrices. By conducting the classification of static
facial expressions using Support Vector Machine (SVM) with a valid Gaussian
kernel on the SPD manifold, we show that deep covariance descriptors are more
effective than the standard classification with fully connected layers and
softmax. Besides, we propose a completely new and original solution to model
the temporal dynamic of facial expressions as deep trajectories on the SPD
manifold. As an extension of the classification pipeline of covariance
descriptors, we apply SVM with valid positive definite kernels derived from
global alignment for deep covariance trajectories classification. By performing
extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that
both the proposed static and dynamic approaches achieve state-of-the-art
performance for facial expression recognition outperforming many recent
approaches.Comment: A preliminary version of this work appeared in "Otberdout N, Kacem A,
Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial
Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018,
Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159."
arXiv admin note: substantial text overlap with arXiv:1805.0386
- …