27,682 research outputs found
Top-down human pose estimation with depth images and domain adaptation
In this paper, a method for estimation of human pose is proposed, making use of ToF (Time of Flight) cameras. For this, a YOLO based object detection method was used, to develop a top-down method. In the first stage, a network was developed to detect people in the image. In the second stage, a network was developed to estimate the joints of each person, using the image result from the first stage. We show that a deep learning network trained from scratch with ToF images yields better results than taking a deep neural network pretrained on RGB data and retraining it with ToF data. We also show that a top-down detector, with a person detector and a joint detector works better than detecting the body joints over the entire image.This work is supported by: European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project no 002797; Funding Reference: POCI-01-0247-FEDER-002797]
Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images
We propose a simple and efficient method for exploiting synthetic images when
training a Deep Network to predict a 3D pose from an image. The ability of
using synthetic images for training a Deep Network is extremely valuable as it
is easy to create a virtually infinite training set made of such images, while
capturing and annotating real images can be very cumbersome. However, synthetic
images do not resemble real images exactly, and using them for training can
result in suboptimal performance. It was recently shown that for exemplar-based
approaches, it is possible to learn a mapping from the exemplar representations
of real images to the exemplar representations of synthetic images. In this
paper, we show that this approach is more general, and that a network can also
be applied after the mapping to infer a 3D pose: At run time, given a real
image of the target object, we first compute the features for the image, map
them to the feature space of synthetic images, and finally use the resulting
features as input to another network which predicts the 3D pose. Since this
network can be trained very effectively by using synthetic images, it performs
very well in practice, and inference is faster and more accurate than with an
exemplar-based approach. We demonstrate our approach on the LINEMOD dataset for
3D object pose estimation from color images, and the NYU dataset for 3D hand
pose estimation from depth maps. We show that it allows us to outperform the
state-of-the-art on both datasets.Comment: CVPR 201
Unsupervised 3D Pose Estimation with Geometric Self-Supervision
We present an unsupervised learning approach to recover 3D human pose from 2D
skeletal joints extracted from a single image. Our method does not require any
multi-view image data, 3D skeletons, correspondences between 2D-3D points, or
use previously learned 3D priors during training. A lifting network accepts 2D
landmarks as inputs and generates a corresponding 3D skeleton estimate. During
training, the recovered 3D skeleton is reprojected on random camera viewpoints
to generate new "synthetic" 2D poses. By lifting the synthetic 2D poses back to
3D and re-projecting them in the original camera view, we can define
self-consistency loss both in 3D and in 2D. The training can thus be self
supervised by exploiting the geometric self-consistency of the
lift-reproject-lift process. We show that self-consistency alone is not
sufficient to generate realistic skeletons, however adding a 2D pose
discriminator enables the lifter to output valid 3D poses. Additionally, to
learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter
network to allow for an expansion of 2D data. This improves results and
demonstrates the usefulness of 2D pose data for unsupervised 3D lifting.
Results on Human3.6M dataset for 3D human pose estimation demonstrate that our
approach improves upon the previous unsupervised methods by 30% and outperforms
many weakly supervised approaches that explicitly use 3D data
Attentive Single-Tasking of Multiple Tasks
In this work we address task interference in universal networks by
considering that a network is trained on multiple tasks, but performs one task
at a time, an approach we refer to as "single-tasking multiple tasks". The
network thus modifies its behaviour through task-dependent feature adaptation,
or task attention. This gives the network the ability to accentuate the
features that are adapted to a task, while shunning irrelevant ones. We further
reduce task interference by forcing the task gradients to be statistically
indistinguishable through adversarial training, ensuring that the common
backbone architecture serving all tasks is not dominated by any of the
task-specific gradients. Results in three multi-task dense labelling problems
consistently show: (i) a large reduction in the number of parameters while
preserving, or even improving performance and (ii) a smooth trade-off between
computation and multi-task accuracy. We provide our system's code and
pre-trained models at http://vision.ee.ethz.ch/~kmaninis/astmt/.Comment: CVPR 2019 Camera Read
Real-Time Human Motion Capture with Multiple Depth Cameras
Commonly used human motion capture systems require intrusive attachment of
markers that are visually tracked with multiple cameras. In this work we
present an efficient and inexpensive solution to markerless motion capture
using only a few Kinect sensors. Unlike the previous work on 3d pose estimation
using a single depth camera, we relax constraints on the camera location and do
not assume a co-operative user. We apply recent image segmentation techniques
to depth images and use curriculum learning to train our system on purely
synthetic data. Our method accurately localizes body parts without requiring an
explicit shape model. The body joint locations are then recovered by combining
evidence from multiple views in real-time. We also introduce a dataset of ~6
million synthetic depth frames for pose estimation from multiple cameras and
exceed state-of-the-art results on the Berkeley MHAD dataset.Comment: Accepted to computer robot vision 201
SuperPoint: Self-Supervised Interest Point Detection and Description
This paper presents a self-supervised framework for training interest point
detectors and descriptors suitable for a large number of multiple-view geometry
problems in computer vision. As opposed to patch-based neural networks, our
fully-convolutional model operates on full-sized images and jointly computes
pixel-level interest point locations and associated descriptors in one forward
pass. We introduce Homographic Adaptation, a multi-scale, multi-homography
approach for boosting interest point detection repeatability and performing
cross-domain adaptation (e.g., synthetic-to-real). Our model, when trained on
the MS-COCO generic image dataset using Homographic Adaptation, is able to
repeatedly detect a much richer set of interest points than the initial
pre-adapted deep model and any other traditional corner detector. The final
system gives rise to state-of-the-art homography estimation results on HPatches
when compared to LIFT, SIFT and ORB.Comment: Camera-ready version for CVPR 2018 Deep Learning for Visual SLAM
Workshop (DL4VSLAM2018
- …