269 research outputs found
Perspective Plane Program Induction from a Single Image
We study the inverse graphics problem of inferring a holistic representation
for natural images. Given an input image, our goal is to induce a
neuro-symbolic, program-like representation that jointly models camera poses,
object locations, and global scene structures. Such high-level, holistic scene
representations further facilitate low-level image manipulation tasks such as
inpainting. We formulate this problem as jointly finding the camera pose and
scene structure that best describe the input image. The benefits of such joint
inference are two-fold: scene regularity serves as a new cue for perspective
correction, and in turn, correct perspective correction leads to a simplified
scene structure, similar to how the correct shape leads to the most regular
texture in shape from texture. Our proposed framework, Perspective Plane
Program Induction (P3I), combines search-based and gradient-based algorithms to
efficiently solve the problem. P3I outperforms a set of baselines on a
collection of Internet images, across tasks including camera pose estimation,
global structure inference, and down-stream image manipulation tasks.Comment: CVPR 2020. First two authors contributed equally. Project page:
http://p3i.csail.mit.edu
Robust Pose Transfer with Dynamic Details using Neural Video Rendering
Pose transfer of human videos aims to generate a high fidelity video of a
target person imitating actions of a source person. A few studies have made
great progress either through image translation with deep latent features or
neural rendering with explicit 3D features. However, both of them rely on large
amounts of training data to generate realistic results, and the performance
degrades on more accessible internet videos due to insufficient training
frames. In this paper, we demonstrate that the dynamic details can be preserved
even trained from short monocular videos. Overall, we propose a neural video
rendering framework coupled with an image-translation-based dynamic details
generation network (D2G-Net), which fully utilizes both the stability of
explicit 3D features and the capacity of learning components. To be specific, a
novel texture representation is presented to encode both the static and
pose-varying appearance characteristics, which is then mapped to the image
space and rendered as a detail-rich frame in the neural rendering stage.
Moreover, we introduce a concise temporal loss in the training stage to
suppress the detail flickering that is made more visible due to high-quality
dynamic details generated by our method. Through extensive comparisons, we
demonstrate that our neural human video renderer is capable of achieving both
clearer dynamic details and more robust performance even on accessible short
videos with only 2k - 4k frames.Comment: Video link: https://www.bilibili.com/video/BV1y64y1C7ge
3DFill:Reference-guided Image Inpainting by Self-supervised 3D Image Alignment
Most existing image inpainting algorithms are based on a single view,
struggling with large holes or the holes containing complicated scenes. Some
reference-guided algorithms fill the hole by referring to another viewpoint
image and use 2D image alignment. Due to the camera imaging process, simple 2D
transformation is difficult to achieve a satisfactory result. In this paper, we
propose 3DFill, a simple and efficient method for reference-guided image
inpainting. Given a target image with arbitrary hole regions and a reference
image from another viewpoint, the 3DFill first aligns the two images by a
two-stage method: 3D projection + 2D transformation, which has better results
than 2D image alignment. The 3D projection is an overall alignment between
images and the 2D transformation is a local alignment focused on the hole
region. The entire process of image alignment is self-supervised. We then fill
the hole in the target image with the contents of the aligned image. Finally,
we use a conditional generation network to refine the filled image to obtain
the inpainting result. 3DFill achieves state-of-the-art performance on image
inpainting across a variety of wide view shifts and has a faster inference
speed than other inpainting models
- …