12,247 research outputs found
Vision-based 3D Pose Retrieval and Reconstruction
The people analysis and the understandings of their motions are the key components in many applications like sports sciences, biomechanics, medical rehabilitation, animated movie productions and the game industry. In this context, retrieval and reconstruction of the articulated 3D human poses are considered as the significant sub-elements. In this dissertation, we address the problem of retrieval and reconstruction of the 3D poses from a monocular video or even from a single RGB image. We propose a few data-driven pipelines to retrieve and reconstruct the 3D poses by exploiting the motion capture data as a prior. The main focus of our proposed approaches is to bridge the gap between the separate media of the 3D marker-based recording and the capturing of motions or photographs using a simple RGB camera. In principal, we leverage both media together efficiently for 3D pose estimation. We have shown that our proposed methodologies need not any synchronized 3D-2D pose-image pairs to retrieve and reconstruct the final 3D poses, and are flexible enough to capture motion in any studio-like indoor environment or outdoor natural environment. In first part of the dissertation, we propose model based approaches for full body human motion reconstruction from the video input by employing just 2D joint positions of the four end effectors and the head. We resolve the 3D-2D pose-image cross model correspondence by developing an intermediate container the knowledge base through the motion capture data which contains information about how people move. It includes the 3D normalized pose space and the corresponding synchronized 2D normalized pose space created by utilizing a number of virtual cameras. We first detect and track the features of these five joints from the input motion sequences using SURF, MSER and colorMSER feature detectors, which vote for the possible 2D locations for these joints in the video. The extraction of suitable feature sets from both, the input control signals and the motion capture data, enables us to retrieve the closest instances from the motion capture dataset through employing the fast searching and retrieval techniques. We develop a graphical structure online lazy neighbourhood graph in order to make the similarity search more accurate and robust by deploying the temporal coherence of the input control signals. The retrieved prior poses are exploited further in order to stabilize the feature detection and tracking process. Finally, the 3D motion sequences are reconstructed by a non-linear optimizer that takes into account multiple energy terms. We evaluate our approaches with a series of experiment scenarios designed in terms of performing actors, camera viewpoints and the noisy inputs. Only a little preprocessing is needed by our methods and the reconstruction processes run close to real time. The second part of the dissertation is dedicated to 3D human pose estimation from a monocular single image. First, we propose an efficient 3D pose retrieval strategy which leads towards a novel data driven approach to reconstruct a 3D human pose from a monocular still image. We design and devise multiple feature sets for global similarity search. At runtime, we search for the similar poses from a motion capture dataset in a definite feature space made up of specific joints. We introduce two-fold method for camera estimation, where we exploit the view directions at which we perform sampling of the MoCap dataset as well as the MoCap priors to minimize the projection error. We also benefit from the MoCap priors and the joints' weights in order to learn a low-dimensional local 3D pose model which is constrained further by multiple energies to infer the final 3D human pose. We thoroughly evaluate our approach on synthetically generated examples, the real internet images and the hand-drawn sketches. We achieve state-of-the-arts results when the test and MoCap data are from the same dataset and obtain competitive results when the motion capture data is taken from a different dataset. Second, we propose a dual source approach for 3D pose estimation from a single RGB image. One major challenge for 3D pose estimation from a single RGB image is the acquisition of sufficient training data. In particular, collecting large amounts of training data that contain unconstrained images and are annotated with accurate 3D poses is infeasible. We therefore propose to use two independent training sources. The first source consists of images with annotated 2D poses and the second source consists of accurate 3D motion capture data. To integrate both sources, we propose a dual-source approach that combines 2D pose estimation with efficient and robust 3D pose retrieval. In our experiments, we show that our approach achieves state-of-the-art results and is even competitive when the skeleton structures of the two sources differ substantially. In the last part of the dissertation, we focus on how the different techniques, developed for the human motion capturing, retrieval and reconstruction can be adapted to handle the quadruped motion capture data and which new applications may appear. We discuss some particularities which must be considered during capturing the large animal motions. For retrieval, we derive the suitable feature sets in order to perform fast searches into the MoCap dataset for similar motion segments. At the end, we present a data-driven approach to reconstruct the quadruped motions from the video input data
Understanding the Limitations of CNN-based Absolute Camera Pose Regression
Visual localization is the task of accurate camera pose estimation in a known
scene. It is a key problem in computer vision and robotics, with applications
including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality.
Traditionally, the localization problem has been tackled using 3D geometry.
Recently, end-to-end approaches based on convolutional neural networks have
become popular. These methods learn to directly regress the camera pose from an
input image. However, they do not achieve the same level of pose accuracy as 3D
structure-based methods. To understand this behavior, we develop a theoretical
model for camera pose regression. We use our model to predict failure cases for
pose regression techniques and verify our predictions through experiments. We
furthermore use our model to show that pose regression is more closely related
to pose approximation via image retrieval than to accurate pose estimation via
3D structure. A key result is that current approaches do not consistently
outperform a handcrafted image retrieval baseline. This clearly shows that
additional research is needed before pose regression algorithms are ready to
compete with structure-based methods.Comment: Initial version of a paper accepted to CVPR 201
An Appearance-Based Framework for 3D Hand Shape Classification and Camera Viewpoint Estimation
An appearance-based framework for 3D hand shape classification and simultaneous camera viewpoint estimation is presented. Given an input image of a segmented hand, the most similar matches from a large database of synthetic hand images are retrieved. The ground truth labels of those matches, containing hand shape and camera viewpoint information, are returned by the system as estimates for the input image. Database retrieval is done hierarchically, by first quickly rejecting the vast majority of all database views, and then ranking the remaining candidates in order of similarity to the input. Four different similarity measures are employed, based on edge location, edge orientation, finger location and geometric moments.National Science Foundation (IIS-9912573, EIA-9809340
Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
We study 3D shape modeling from a single image and make contributions to it
in three aspects. First, we present Pix3D, a large-scale benchmark of diverse
image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications
in shape-related tasks including reconstruction, retrieval, viewpoint
estimation, etc. Building such a large-scale dataset, however, is highly
challenging; existing datasets either contain only synthetic data, or lack
precise alignment between 2D images and 3D shapes, or only have a small number
of images. Second, we calibrate the evaluation criteria for 3D shape
reconstruction through behavioral studies, and use them to objectively and
systematically benchmark cutting-edge reconstruction algorithms on Pix3D.
Third, we design a novel model that simultaneously performs 3D reconstruction
and pose estimation; our multi-task learning approach achieves state-of-the-art
performance on both tasks.Comment: CVPR 2018. The first two authors contributed equally to this work.
Project page: http://pix3d.csail.mit.ed
Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views
This paper presents an end-to-end convolutional neural network (CNN) for
2D-3D exemplar detection. We demonstrate that the ability to adapt the features
of natural images to better align with those of CAD rendered views is critical
to the success of our technique. We show that the adaptation can be learned by
compositing rendered views of textured object models on natural images. Our
approach can be naturally incorporated into a CNN detection pipeline and
extends the accuracy and speed benefits from recent advances in deep learning
to 2D-3D exemplar detection. We applied our method to two tasks: instance
detection, where we evaluated on the IKEA dataset, and object category
detection, where we out-perform Aubry et al. for "chair" detection on a subset
of the Pascal VOC dataset.Comment: To appear in CVPR 201
- …