146 research outputs found
Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators
Structure-guided image completion aims to inpaint a local region of an image
according to an input guidance map from users. While such a task enables many
practical applications for interactive editing, existing methods often struggle
to hallucinate realistic object instances in complex natural scenes. Such a
limitation is partially due to the lack of semantic-level constraints inside
the hole region as well as the lack of a mechanism to enforce realistic object
generation. In this work, we propose a learning paradigm that consists of
semantic discriminators and object-level discriminators for improving the
generation of complex semantics and objects. Specifically, the semantic
discriminators leverage pretrained visual features to improve the realism of
the generated visual concepts. Moreover, the object-level discriminators take
aligned instances as inputs to enforce the realism of individual objects. Our
proposed scheme significantly improves the generation quality and achieves
state-of-the-art results on various tasks, including segmentation-guided
completion, edge-guided manipulation and panoptically-guided manipulation on
Places2 datasets. Furthermore, our trained model is flexible and can support
multiple editing use cases, such as object insertion, replacement, removal and
standard inpainting. In particular, our trained model combined with a novel
automatic image completion pipeline achieves state-of-the-art results on the
standard inpainting task.Comment: 18 pages, 16 figure
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild
Modeling the 3D world from sensor data for simulation is a scalable way of
developing testing and validation environments for robotic learning problems
such as autonomous driving. However, manually creating or re-creating
real-world-like environments is difficult, expensive, and not scalable. Recent
generative model techniques have shown promising progress to address such
challenges by learning 3D assets using only plentiful 2D images -- but still
suffer limitations as they leverage either human-curated image datasets or
renderings from manually-created synthetic 3D environments. In this paper, we
introduce GINA-3D, a generative model that uses real-world driving data from
camera and LiDAR sensors to create realistic 3D implicit neural assets of
diverse vehicles and pedestrians. Compared to the existing image datasets, the
real-world driving setting poses new challenges due to occlusions,
lighting-variations and long-tail distributions. GINA-3D tackles these
challenges by decoupling representation learning and generative modeling into
two stages with a learned tri-plane latent structure, inspired by recent
advances in generative modeling of images. To evaluate our approach, we
construct a large-scale object-centric dataset containing over 520K images of
vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K
images of long-tail instances such as construction equipment, garbage trucks,
and cable cars. We compare our model with existing approaches and demonstrate
that it achieves state-of-the-art performance in quality and diversity for both
generated images and geometries.Comment: Accepted by CVPR 202
DiffRF: Rendering-Guided 3D Radiance Field Diffusion
We introduce DiffRF, a novel approach for 3D radiance field synthesis based
on denoising diffusion probabilistic models. While existing diffusion-based
methods operate on images, latent codes, or point cloud data, we are the first
to directly generate volumetric radiance fields. To this end, we propose a 3D
denoising model which directly operates on an explicit voxel grid
representation. However, as radiance fields generated from a set of posed
images can be ambiguous and contain artifacts, obtaining ground truth radiance
field samples is non-trivial. We address this challenge by pairing the
denoising formulation with a rendering loss, enabling our model to learn a
deviated prior that favours good image quality instead of trying to replicate
fitting errors like floating artifacts. In contrast to 2D-diffusion models, our
model learns multi-view consistent priors, enabling free-view synthesis and
accurate shape generation. Compared to 3D GANs, our diffusion-based approach
naturally enables conditional generation such as masked completion or
single-view 3D synthesis at inference time.Comment: Project page: https://sirwyver.github.io/DiffRF/ Video:
https://youtu.be/qETBcLu8SUk - CVPR 2023 Highlight - updated evaluations
after fixing initial data mapping error on all method
Unsupervised Learning of Efficient Geometry-Aware Neural Articulated Representations
We propose an unsupervised method for 3D geometry-aware representation
learning of articulated objects. Though photorealistic images of articulated
objects can be rendered with explicit pose control through existing 3D neural
representations, these methods require ground truth 3D pose and foreground
masks for training, which are expensive to obtain. We obviate this need by
learning the representations with GAN training. From random poses and latent
vectors, the generator is trained to produce realistic images of articulated
objects by adversarial training. To avoid a large computational cost for GAN
training, we propose an efficient neural representation for articulated objects
based on tri-planes and then present a GAN-based framework for its unsupervised
training. Experiments demonstrate the efficiency of our method and show that
GAN-based training enables learning of controllable 3D representations without
supervision.Comment: 19 pages, project page https://nogu-atsu.github.io/ENARF-GAN
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation,
which unifies pre-trained text-image diffusion and discriminative models to
perform open-vocabulary panoptic segmentation. Text-to-image diffusion models
have shown the remarkable capability of generating high-quality images with
diverse open-vocabulary language descriptions. This demonstrates that their
internal representation space is highly correlated with open concepts in the
real world. Text-image discriminative models like CLIP, on the other hand, are
good at classifying images into open-vocabulary labels. We propose to leverage
the frozen representation of both these models to perform panoptic segmentation
of any category in the wild. Our approach outperforms the previous state of the
art by significant margins on both open-vocabulary panoptic and semantic
segmentation tasks. In particular, with COCO training only, our method achieves
23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute
improvement over the previous state-of-the-art. Project page is available at
https://jerryxu.net/ODISE .Comment: CVPR 2023. Project page: https://jerryxu.net/ODIS
Instance-Aware Image Completion
Image completion is a task that aims to fill in the missing region of a
masked image with plausible contents. However, existing image completion
methods tend to fill in the missing region with the surrounding texture instead
of hallucinating a visual instance that is suitable in accordance with the
context of the scene. In this work, we propose a novel image completion model,
dubbed ImComplete, that hallucinates the missing instance that harmonizes well
with - and thus preserves - the original context. ImComplete first adopts a
transformer architecture that considers the visible instances and the location
of the missing region. Then, ImComplete completes the semantic segmentation
masks within the missing region, providing pixel-level semantic and structural
guidance. Finally, the image synthesis blocks generate photo-realistic content.
We perform a comprehensive evaluation of the results in terms of visual quality
(LPIPS and FID) and contextual preservation scores (CLIPscore and object
detection accuracy) with COCO-panoptic and Visual Genome datasets. Experimental
results show the superiority of ImComplete on various natural images
SpaText: Spatio-Textual Representation for Controllable Image Generation
Recent text-to-image diffusion models are able to generate convincing results
of unprecedented quality. However, it is nearly impossible to control the
shapes of different regions/objects or their layout in a fine-grained fashion.
Previous attempts to provide such controls were hindered by their reliance on a
fixed set of labels. To this end, we present SpaText - a new method for
text-to-image generation using open-vocabulary scene control. In addition to a
global text prompt that describes the entire scene, the user provides a
segmentation map where each region of interest is annotated by a free-form
natural language description. Due to lack of large-scale datasets that have a
detailed textual description for each region in the image, we choose to
leverage the current large-scale text-to-image datasets and base our approach
on a novel CLIP-based spatio-textual representation, and show its effectiveness
on two state-of-the-art diffusion models: pixel-based and latent-based. In
addition, we show how to extend the classifier-free guidance method in
diffusion models to the multi-conditional case and present an alternative
accelerated inference algorithm. Finally, we offer several automatic evaluation
metrics and use them, in addition to FID scores and a user study, to evaluate
our method and show that it achieves state-of-the-art results on image
generation with free-form textual scene control.Comment: CVPR 2023. Project page available at:
https://omriavrahami.com/spatex
Weakly Supervised Learning for Multi-Image Synthesis
Machine learning-based approaches have been achieving state-of-the-art results on many computer vision tasks. While deep learning and convolutional networks have been incredibly popular, these approaches come at the expense of huge amounts of labeled data required for training. Manually annotating large amounts of data, often millions of images in a single dataset, is costly and time consuming. To deal with the problem of data annotation, the research community has been exploring approaches that require less amount of labelled data.
The central problem that we consider in this research is image synthesis without any manual labeling. Image synthesis is a classic computer vision task that requires understanding of image contents and their semantic and geometric properties. We propose that we can train image synthesis models by relying on sequences of videos and using weakly supervised learning. Large amounts of unlabeled data are freely available on the internet. We propose to set up the training in a multi-image setting so that we can use one of the images as the target - this allows us to rely only on images for training and removes the need for manual annotations. We demonstrate three main contributions in this work.
First, we present a method of fusing multiple noisy overhead images to make a single, artifact-free image. We present a weakly supervised method that relies on crowd-sourced labels from online maps and a completely unsupervised variant that only requires a series of satellite images as inputs. Second, we propose a single-image novel view synthesis method for complex, outdoor scenes. We propose a learning-based method that uses pairs of nearby images captured on urban roads and their respective GPS coordinates as supervision. We show that a model trained with this automatically captured data can render a new view of a scene that can be as far as 10 meters from the input image. Third, we consider the problem of synthesizing new images of a scene under different conditions, such as time of day and season, based on a single input image. As opposed to existing methods, we do not need manual annotations for transient attributes, such as fog or snow, for training. We train our model by using streams of images captured from outdoor webcams and time-lapse videos.
Through these applications, we show several settings where we can train state-of-the-art deep learning methods without manual annotations. This work focuses on three image synthesis tasks. We propose weakly supervised learning and remove requirements for manual annotations by relying on sequences of images. Our approach is in line with the research efforts that aim to minimize the labels required for training machine learning methods
- …