25 research outputs found
Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision
We present a unified framework tackling two problems: class-specific 3D
reconstruction from a single image, and generation of new 3D shape samples.
These tasks have received considerable attention recently; however, existing
approaches rely on 3D supervision, annotation of 2D images with keypoints or
poses, and/or training with multiple views of each object instance. Our
framework is very general: it can be trained in similar settings to these
existing approaches, while also supporting weaker supervision scenarios.
Importantly, it can be trained purely from 2D images, without ground-truth pose
annotations, and with a single view per instance. We employ meshes as an output
representation, instead of voxels used in most prior work. This allows us to
exploit shading information during training, which previous 2D-supervised
methods cannot. Thus, our method can learn to generate and reconstruct concave
object classes. We evaluate our approach on synthetic data in various settings,
showing that (i) it learns to disentangle shape from pose; (ii) using shading
in the loss improves performance; (iii) our model is comparable or superior to
state-of-the-art voxel-based approaches on quantitative metrics, while
producing results that are visually more pleasing; (iv) it still performs well
when given supervision weaker than in prior works.Comment: BMVC 2018 (Oral). Differentiable renderer available at
https://github.com/pmh47/dir
Learning to Reconstruct People in Clothing from a Single RGB Camera
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach
Physics-based Simulation of Continuous-Wave LIDAR for Localization, Calibration and Tracking
Light Detection and Ranging (LIDAR) sensors play an important role in the
perception stack of autonomous robots, supplying mapping and localization
pipelines with depth measurements of the environment. While their accuracy
outperforms other types of depth sensors, such as stereo or time-of-flight
cameras, the accurate modeling of LIDAR sensors requires laborious manual
calibration that typically does not take into account the interaction of laser
light with different surface types, incidence angles and other phenomena that
significantly influence measurements. In this work, we introduce a physically
plausible model of a 2D continuous-wave LIDAR that accounts for the
surface-light interactions and simulates the measurement process in the Hokuyo
URG-04LX LIDAR. Through automatic differentiation, we employ gradient-based
optimization to estimate model parameters from real sensor measurements.Comment: Published at ICRA 202
End-to-End Optimization of Scene Layout
We propose an end-to-end variational generative model for scene layout
synthesis conditioned on scene graphs. Unlike unconditional scene layout
generation, we use scene graphs as an abstract but general representation to
guide the synthesis of diverse scene layouts that satisfy relationships
included in the scene graph. This gives rise to more flexible control over the
synthesis process, allowing various forms of inputs such as scene layouts
extracted from sentences or inferred from a single color image. Using our
conditional layout synthesizer, we can generate various layouts that share the
same structure of the input example. In addition to this conditional generation
design, we also integrate a differentiable rendering module that enables layout
refinement using only 2D projections of the scene. Given a depth and a
semantics map, the differentiable rendering module enables optimizing over the
synthesized layout to fit the given input in an analysis-by-synthesis fashion.
Experiments suggest that our model achieves higher accuracy and diversity in
conditional scene synthesis and allows exemplar-based scene generation from
various input forms.Comment: CVPR 2020 (Oral). Project page: http://3dsln.csail.mit.edu
Multi-Garment Net: {L}earning to Dress {3D} People from Images
We present Multi-Garment Network (MGN), a method to predict body shape and clothing, layered on top of the SMPL model from a few frames (1-8) of a video. Several experiments demonstrate that this representation allows higher level of control when compared to single mesh or voxel representations of shape. Our model allows to predict garment geometry, relate it to the body shape, and transfer it to new body shapes and poses. To train MGN, we leverage a digital wardrobe containing 712 digital garments in correspondence, obtained with a novel method to register a set of clothing templates to a dataset of real 3D scans of people in different clothing and poses. Garments from the digital wardrobe, or predicted by MGN, can be used to dress any body shape in arbitrary poses. We will make publicly available the digital wardrobe, the MGN model, and code to dress SMPL with the garments