19,550 research outputs found
Mixing Body-Part Sequences for Human Pose Estimation
International audienceIn this paper, we present a method for estimating articulated human poses in videos. We cast this as an optimization problem defined on body parts with spatio-temporal links between them. The resulting formulation is unfortunately intractable and previous approaches only provide approximate solutions. Although such methods perform well on certain body parts, e.g., head, their performance on lower arms, i.e., elbows and wrists, remains poor. We present a new approximate scheme with two steps dedicated to pose estimation. First, our approach takes into account temporal links with subsequent frames for the less-certain parts, namely elbows and wrists. Second, our method decomposes poses into limbs, generates limb sequences across time, and recomposes poses by mixing these body part sequences. We introduce a new dataset "Poses in the Wild", which is more challenging than the existing ones, with sequences containing background clutter, occlusions, and severe camera motion. We experimentally compare our method with recent approaches on this new dataset as well as on two other benchmark datasets, and show significant improvement
3D human pose estimation from depth maps using a deep combination of poses
Many real-world applications require the estimation of human body joints for
higher-level tasks as, for example, human behaviour understanding. In recent
years, depth sensors have become a popular approach to obtain three-dimensional
information. The depth maps generated by these sensors provide information that
can be employed to disambiguate the poses observed in two-dimensional images.
This work addresses the problem of 3D human pose estimation from depth maps
employing a Deep Learning approach. We propose a model, named Deep Depth Pose
(DDP), which receives a depth map containing a person and a set of predefined
3D prototype poses and returns the 3D position of the body joints of the
person. In particular, DDP is defined as a ConvNet that computes the specific
weights needed to linearly combine the prototypes for the given input. We have
thoroughly evaluated DDP on the challenging 'ITOP' and 'UBC3V' datasets, which
respectively depict realistic and synthetic samples, defining a new
state-of-the-art on them.Comment: Accepted for publication at "Journal of Visual Communication and
Image Representation
RGB-D datasets using microsoft kinect or similar sensors: a survey
RGB-D data has turned out to be a very useful representation of an indoor scene for solving fundamental computer vision problems. It takes the advantages of the color image that provides appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. In recent years, more and more RGB-D image/video datasets dedicated to various applications have become available, which are of great importance to benchmark the state-of-the-art. In this paper, we systematically survey popular RGB-D datasets for different applications including object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation. We provide the insights into the characteristics of each important dataset, and compare the popularity and the difficulty of those datasets. Overall, the main goal of this survey is to give a comprehensive description about the available RGB-D datasets and thus to guide researchers in the selection of suitable datasets for evaluating their algorithms
Skeleton-aided Articulated Motion Generation
This work make the first attempt to generate articulated human motion
sequence from a single image. On the one hand, we utilize paired inputs
including human skeleton information as motion embedding and a single human
image as appearance reference, to generate novel motion frames, based on the
conditional GAN infrastructure. On the other hand, a triplet loss is employed
to pursue appearance-smoothness between consecutive frames. As the proposed
framework is capable of jointly exploiting the image appearance space and
articulated/kinematic motion space, it generates realistic articulated motion
sequence, in contrast to most previous video generation methods which yield
blurred motion effects. We test our model on two human action datasets
including KTH and Human3.6M, and the proposed framework generates very
promising results on both datasets.Comment: ACM MM 201
MX-LSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses
Recent approaches on trajectory forecasting use tracklets to predict the
future positions of pedestrians exploiting Long Short Term Memory (LSTM)
architectures. This paper shows that adding vislets, that is, short sequences
of head pose estimations, allows to increase significantly the trajectory
forecasting performance. We then propose to use vislets in a novel framework
called MX-LSTM, capturing the interplay between tracklets and vislets thanks to
a joint unconstrained optimization of full covariance matrices during the LSTM
backpropagation. At the same time, MX-LSTM predicts the future head poses,
increasing the standard capabilities of the long-term trajectory forecasting
approaches. With standard head pose estimators and an attentional-based social
pooling, MX-LSTM scores the new trajectory forecasting state-of-the-art in all
the considered datasets (Zara01, Zara02, UCY, and TownCentre) with a dramatic
margin when the pedestrians slow down, a case where most of the forecasting
approaches struggle to provide an accurate solution.Comment: 10 pages, 3 figures to appear in CVPR 201
- …