7,431 research outputs found
The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping
Many tasks performed by autonomous vehicles such as road marking detection,
object tracking, and path planning are simpler in bird's-eye view. Hence,
Inverse Perspective Mapping (IPM) is often applied to remove the perspective
effect from a vehicle's front-facing camera and to remap its images into a 2D
domain, resulting in a top-down view. Unfortunately, however, this leads to
unnatural blurring and stretching of objects at further distance, due to the
resolution of the camera, limiting applicability. In this paper, we present an
adversarial learning approach for generating a significantly improved IPM from
a single camera image in real time. The generated bird's-eye-view images
contain sharper features (e.g. road markings) and a more homogeneous
illumination, while (dynamic) objects are automatically removed from the scene,
thus revealing the underlying road layout in an improved fashion. We
demonstrate our framework using real-world data from the Oxford RobotCar
Dataset and show that scene understanding tasks directly benefit from our
boosted IPM approach.Comment: equal contribution of first two authors, 8 full pages, 6 figures,
accepted at IV 201
Scene Graph Generation with External Knowledge and Image Reconstruction
Scene graph generation has received growing attention with the advancements
in image understanding tasks such as object detection, attributes and
relationship prediction,~\etc. However, existing datasets are biased in terms
of object and relationship labels, or often come with noisy and missing
annotations, which makes the development of a reliable scene graph prediction
model very challenging. In this paper, we propose a novel scene graph
generation algorithm with external knowledge and image reconstruction loss to
overcome these dataset issues. In particular, we extract commonsense knowledge
from the external knowledge base to refine object and phrase features for
improving generalizability in scene graph generation. To address the bias of
noisy object annotations, we introduce an auxiliary image reconstruction path
to regularize the scene graph generation network. Extensive experiments show
that our framework can generate better scene graphs, achieving the
state-of-the-art performance on two benchmark datasets: Visual Relationship
Detection and Visual Genome datasets.Comment: 10 pages, 5 figures, Accepted in CVPR 201
Object-Centric Image Generation from Layouts
Despite recent impressive results on single-object and single-domain image
generation, the generation of complex scenes with multiple objects remains
challenging. In this paper, we start with the idea that a model must be able to
understand individual objects and relationships between objects in order to
generate complex scenes well. Our layout-to-image-generation method, which we
call Object-Centric Generative Adversarial Network (or OC-GAN), relies on a
novel Scene-Graph Similarity Module (SGSM). The SGSM learns representations of
the spatial relationships between objects in the scene, which lead to our
model's improved layout-fidelity. We also propose changes to the conditioning
mechanism of the generator that enhance its object instance-awareness. Apart
from improving image quality, our contributions mitigate two failure modes in
previous approaches: (1) spurious objects being generated without corresponding
bounding boxes in the layout, and (2) overlapping bounding boxes in the layout
leading to merged objects in images. Extensive quantitative evaluation and
ablation studies demonstrate the impact of our contributions, with our model
outperforming previous state-of-the-art approaches on both the COCO-Stuff and
Visual Genome datasets. Finally, we address an important limitation of
evaluation metrics used in previous works by introducing SceneFID -- an
object-centric adaptation of the popular Fr{\'e}chet Inception Distance metric,
that is better suited for multi-object images.Comment: AAAI 202
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs
Controllable scene synthesis aims to create interactive environments for
various industrial use cases. Scene graphs provide a highly suitable interface
to facilitate these applications by abstracting the scene context in a compact
manner. Existing methods, reliant on retrieval from extensive databases or
pre-trained shape embeddings, often overlook scene-object and object-object
relationships, leading to inconsistent results due to their limited generation
capacity. To address this issue, we present CommonScenes, a fully generative
model that converts scene graphs into corresponding controllable 3D scenes,
which are semantically realistic and conform to commonsense. Our pipeline
consists of two branches, one predicting the overall scene layout via a
variational auto-encoder and the other generating compatible shapes via latent
diffusion, capturing global scene-object and local inter-object relationships
while preserving shape diversity. The generated scenes can be manipulated by
editing the input scene graph and sampling the noise in the diffusion model.
Due to lacking a scene graph dataset offering high-quality object-level meshes
with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor
dataset 3D-FRONT with additional scene graph labels. Extensive experiments are
conducted on SG-FRONT where CommonScenes shows clear advantages over other
methods regarding generation consistency, quality, and diversity. Codes and the
dataset will be released upon acceptance
Learning Segmentation Masks with the Independence Prior
An instance with a bad mask might make a composite image that uses it look
fake. This encourages us to learn segmentation by generating realistic
composite images. To achieve this, we propose a novel framework that exploits a
new proposed prior called the independence prior based on Generative
Adversarial Networks (GANs). The generator produces an image with multiple
category-specific instance providers, a layout module and a composition module.
Firstly, each provider independently outputs a category-specific instance image
with a soft mask. Then the provided instances' poses are corrected by the
layout module. Lastly, the composition module combines these instances into a
final image. Training with adversarial loss and penalty for mask area, each
provider learns a mask that is as small as possible but enough to cover a
complete category-specific instance. Weakly supervised semantic segmentation
methods widely use grouping cues modeling the association between image parts,
which are either artificially designed or learned with costly segmentation
labels or only modeled on local pairs. Unlike them, our method automatically
models the dependence between any parts and learns instance segmentation. We
apply our framework in two cases: (1) Foreground segmentation on
category-specific images with box-level annotation. (2) Unsupervised learning
of instance appearances and masks with only one image of homogeneous object
cluster (HOC). We get appealing results in both tasks, which shows the
independence prior is useful for instance segmentation and it is possible to
unsupervisedly learn instance masks with only one image.Comment: 7+5 pages, 13 figures, Accepted to AAAI 201
- …