26,505 research outputs found
Visual Object Networks: Image Generation with Disentangled 3D Representation
Recent progress in deep generative models has led to tremendous breakthroughs
in image generation. However, while existing models can synthesize
photorealistic images, they lack an understanding of our underlying 3D world.
We present a new generative model, Visual Object Networks (VON), synthesizing
natural images of objects with a disentangled 3D representation. Inspired by
classic graphics rendering pipelines, we unravel our image formation process
into three conditionally independent factors---shape, viewpoint, and
texture---and present an end-to-end adversarial learning framework that jointly
models 3D shapes and 2D images. Our model first learns to synthesize 3D shapes
that are indistinguishable from real shapes. It then renders the object's 2.5D
sketches (i.e., silhouette and depth map) from its shape under a sampled
viewpoint. Finally, it learns to add realistic texture to these 2.5D sketches
to generate natural images. The VON not only generates images that are more
realistic than state-of-the-art 2D image synthesis methods, but also enables
many 3D operations such as changing the viewpoint of a generated image, editing
of shape and texture, linear interpolation in texture and shape space, and
transferring appearance across different objects and viewpoints.Comment: NeurIPS 2018. Code: https://github.com/junyanz/VON Website:
http://von.csail.mit.edu
Loss-resilient Coding of Texture and Depth for Free-viewpoint Video Conferencing
Free-viewpoint video conferencing allows a participant to observe the remote
3D scene from any freely chosen viewpoint. An intermediate virtual viewpoint
image is commonly synthesized using two pairs of transmitted texture and depth
maps from two neighboring captured viewpoints via depth-image-based rendering
(DIBR). To maintain high quality of synthesized images, it is imperative to
contain the adverse effects of network packet losses that may arise during
texture and depth video transmission. Towards this end, we develop an
integrated approach that exploits the representation redundancy inherent in the
multiple streamed videos a voxel in the 3D scene visible to two captured views
is sampled and coded twice in the two views. In particular, at the receiver we
first develop an error concealment strategy that adaptively blends
corresponding pixels in the two captured views during DIBR, so that pixels from
the more reliable transmitted view are weighted more heavily. We then couple it
with a sender-side optimization of reference picture selection (RPS) during
real-time video coding, so that blocks containing samples of voxels that are
visible in both views are more error-resiliently coded in one view only, given
adaptive blending will erase errors in the other view. Further, synthesized
view distortion sensitivities to texture versus depth errors are analyzed, so
that relative importance of texture and depth code blocks can be computed for
system-wide RPS optimization. Experimental results show that the proposed
scheme can outperform the use of a traditional feedback channel by up to 0.82
dB on average at 8% packet loss rate, and by as much as 3 dB for particular
frames
3D Shape Reconstruction from Sketches via Multi-view Convolutional Networks
We propose a method for reconstructing 3D shapes from 2D sketches in the form
of line drawings. Our method takes as input a single sketch, or multiple
sketches, and outputs a dense point cloud representing a 3D reconstruction of
the input sketch(es). The point cloud is then converted into a polygon mesh. At
the heart of our method lies a deep, encoder-decoder network. The encoder
converts the sketch into a compact representation encoding shape information.
The decoder converts this representation into depth and normal maps capturing
the underlying surface from several output viewpoints. The multi-view maps are
then consolidated into a 3D point cloud by solving an optimization problem that
fuses depth and normals across all viewpoints. Based on our experiments,
compared to other methods, such as volumetric networks, our architecture offers
several advantages, including more faithful reconstruction, higher output
surface resolution, better preservation of topology and shape structure.Comment: 3DV 2017 (oral
Learning to Synthesize a 4D RGBD Light Field from a Single Image
We present a machine learning algorithm that takes as input a 2D RGB image
and synthesizes a 4D RGBD light field (color and depth of the scene in each ray
direction). For training, we introduce the largest public light field dataset,
consisting of over 3300 plenoptic camera light fields of scenes containing
flowers and plants. Our synthesis pipeline consists of a convolutional neural
network (CNN) that estimates scene geometry, a stage that renders a Lambertian
light field using that geometry, and a second CNN that predicts occluded rays
and non-Lambertian effects. Our algorithm builds on recent view synthesis
methods, but is unique in predicting RGBD for each light field ray and
improving unsupervised single image depth estimation by enforcing consistency
of ray depths that should intersect the same scene point. Please see our
supplementary video at https://youtu.be/yLCvWoQLnmsComment: International Conference on Computer Vision (ICCV) 201
- …