67,055 research outputs found
Exploiting temporal information for 3D pose estimation
In this work, we address the problem of 3D human pose estimation from a
sequence of 2D human poses. Although the recent success of deep networks has
led many state-of-the-art methods for 3D pose estimation to train deep networks
end-to-end to predict from images directly, the top-performing approaches have
shown the effectiveness of dividing the task of 3D pose estimation into two
steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from
images and then mapping them into 3D space. They also showed that a
low-dimensional representation like 2D locations of a set of joints can be
discriminative enough to estimate 3D pose with high accuracy. However,
estimation of 3D pose for individual frames leads to temporally incoherent
estimates due to independent error in each frame causing jitter. Therefore, in
this work we utilize the temporal information across a sequence of 2D joint
locations to estimate a sequence of 3D poses. We designed a
sequence-to-sequence network composed of layer-normalized LSTM units with
shortcut connections connecting the input to the output on the decoder side and
imposed temporal smoothness constraint during training. We found that the
knowledge of temporal consistency improves the best reported result on
Human3.6M dataset by approximately and helps our network to recover
temporally consistent 3D poses over a sequence of images even when the 2D pose
detector fails
Encouraging LSTMs to Anticipate Actions Very Early
In contrast to the widely studied problem of recognizing an action given a
complete sequence, action anticipation aims to identify the action from only
partially available videos. As such, it is therefore key to the success of
computer vision applications requiring to react as early as possible, such as
autonomous navigation. In this paper, we propose a new action anticipation
method that achieves high prediction accuracy even in the presence of a very
small percentage of a video sequence. To this end, we develop a multi-stage
LSTM architecture that leverages context-aware and action-aware features, and
introduce a novel loss function that encourages the model to predict the
correct class as early as possible. Our experiments on standard benchmark
datasets evidence the benefits of our approach; We outperform the
state-of-the-art action anticipation methods for early prediction by a relative
increase in accuracy of 22.0% on JHMDB-21, 14.0% on UT-Interaction and 49.9% on
UCF-101.Comment: 13 Pages, 7 Figures, 11 Tables. Accepted in ICCV 2017. arXiv admin
note: text overlap with arXiv:1611.0552
Objects2action: Classifying and localizing actions without any video example
The goal of this paper is to recognize actions in video without the need for
examples. Different from traditional zero-shot approaches we do not demand the
design and specification of attribute classifiers and class-to-attribute
mappings to allow for transfer from seen classes to unseen classes. Our key
contribution is objects2action, a semantic word embedding that is spanned by a
skip-gram model of thousands of object categories. Action labels are assigned
to an object encoding of unseen video based on a convex combination of action
and object affinities. Our semantic embedding has three main characteristics to
accommodate for the specifics of actions. First, we propose a mechanism to
exploit multiple-word descriptions of actions and objects. Second, we
incorporate the automated selection of the most responsive objects per action.
And finally, we demonstrate how to extend our zero-shot approach to the
spatio-temporal localization of actions in video. Experiments on four action
datasets demonstrate the potential of our approach
Hand gesture recognition with jointly calibrated Leap Motion and depth sensor
Novel 3D acquisition devices like depth cameras and the Leap Motion have recently reached the market. Depth cameras allow to obtain a complete 3D description of the framed scene while the Leap Motion sensor is a device explicitly targeted for hand gesture recognition and provides only a limited set of relevant points. This paper shows how to jointly exploit the two types of sensors for accurate gesture recognition. An ad-hoc solution for the joint calibration of the two devices is firstly presented. Then a set of novel feature descriptors is introduced both for the Leap Motion and for depth data. Various schemes based on the distances of the hand samples from the centroid, on the curvature of the hand contour and on the convex hull of the hand shape are employed and the use of Leap Motion data to aid feature extraction is also considered. The proposed feature sets are fed to two different classifiers, one based on multi-class SVMs and one exploiting Random Forests. Different feature selection algorithms have also been tested in order to reduce the complexity of the approach. Experimental results show that a very high accuracy can be obtained from the proposed method. The current implementation is also able to run in real-time
- …