82,833 research outputs found
Combining Local Appearance and Holistic View: Dual-Source Deep Neural Networks for Human Pose Estimation
We propose a new learning-based method for estimating 2D human pose from a
single image, using Dual-Source Deep Convolutional Neural Networks (DS-CNN).
Recently, many methods have been developed to estimate human pose by using pose
priors that are estimated from physiologically inspired graphical models or
learned from a holistic perspective. In this paper, we propose to integrate
both the local (body) part appearance and the holistic view of each local part
for more accurate human pose estimation. Specifically, the proposed DS-CNN
takes a set of image patches (category-independent object proposals for
training and multi-scale sliding windows for testing) as the input and then
learns the appearance of each local part by considering their holistic views in
the full body. Using DS-CNN, we achieve both joint detection, which determines
whether an image patch contains a body joint, and joint localization, which
finds the exact location of the joint in the image patch. Finally, we develop
an algorithm to combine these joint detection/localization results from all the
image patches for estimating the human pose. The experimental results show the
effectiveness of the proposed method by comparing to the state-of-the-art
human-pose estimation methods based on pose priors that are estimated from
physiologically inspired graphical models or learned from a holistic
perspective.Comment: CVPR 201
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
Deep Semantic Classification for 3D LiDAR Data
Robots are expected to operate autonomously in dynamic environments.
Understanding the underlying dynamic characteristics of objects is a key
enabler for achieving this goal. In this paper, we propose a method for
pointwise semantic classification of 3D LiDAR data into three classes:
non-movable, movable and dynamic. We concentrate on understanding these
specific semantics because they characterize important information required for
an autonomous system. Non-movable points in the scene belong to unchanging
segments of the environment, whereas the remaining classes corresponds to the
changing parts of the scene. The difference between the movable and dynamic
class is their motion state. The dynamic points can be perceived as moving,
whereas movable objects can move, but are perceived as static. To learn the
distinction between movable and non-movable points in the environment, we
introduce an approach based on deep neural network and for detecting the
dynamic points, we estimate pointwise motion. We propose a Bayes filter
framework for combining the learned semantic cues with the motion cues to infer
the required semantic classification. In extensive experiments, we compare our
approach with other methods on a standard benchmark dataset and report
competitive results in comparison to the existing state-of-the-art.
Furthermore, we show an improvement in the classification of points by
combining the semantic cues retrieved from the neural network with the motion
cues.Comment: 8 pages to be published in IROS 201
- …