11,434 research outputs found
Analysis and Modeling of 3D Indoor Scenes
We live in a 3D world, performing activities and interacting with objects in
the indoor environments everyday. Indoor scenes are the most familiar and
essential environments in everyone's life. In the virtual world, 3D indoor
scenes are also ubiquitous in 3D games and interior design. With the fast
development of VR/AR devices and the emerging applications, the demand of
realistic 3D indoor scenes keeps growing rapidly. Currently, designing detailed
3D indoor scenes requires proficient 3D designing and modeling skills and is
often time-consuming. For novice users, creating realistic and complex 3D
indoor scenes is even more difficult and challenging.
Many efforts have been made in different research communities, e.g. computer
graphics, vision and robotics, to capture, analyze and generate the 3D indoor
data. This report mainly focuses on the recent research progress in graphics on
geometry, structure and semantic analysis of 3D indoor data and different
modeling techniques for creating plausible and realistic indoor scenes. We
first review works on understanding and semantic modeling of scenes from
captured 3D data of the real world. Then, we focus on the virtual scenes
composed of 3D CAD models and study methods for 3D scene analysis and
processing. After that, we survey various modeling paradigms for creating 3D
indoor scenes and investigate human-centric scene analysis and modeling, which
bridge indoor scene studies of graphics, vision and robotics. At last, we
discuss open problems in indoor scene processing that might bring interests to
graphics and all related communities
SmartAnnotator: An Interactive Tool for Annotating RGBD Indoor Images
RGBD images with high quality annotations in the form of geometric (i.e.,
segmentation) and structural (i.e., how do the segments are mutually related in
3D) information provide valuable priors to a large number of scene and image
manipulation applications. While it is now simple to acquire RGBD images,
annotating them, automatically or manually, remains challenging especially in
cluttered noisy environments. We present SmartAnnotator, an interactive system
to facilitate annotating RGBD images. The system performs the tedious tasks of
grouping pixels, creating potential abstracted cuboids, inferring object
interactions in 3D, and comes up with various hypotheses. The user simply has
to flip through a list of suggestions for segment labels, finalize a selection,
and the system updates the remaining hypotheses. As objects are finalized, the
process speeds up with fewer ambiguities to resolve. Further, as more scenes
are annotated, the system makes better suggestions based on structural and
geometric priors learns from the previous annotation sessions. We test our
system on a large number of database scenes and report significant improvements
over naive low-level annotation tools.Comment: 10 page
Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars
We propose a systematic learning-based approach to the generation of massive
quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D
images thereof, with associated ground truth information, for the purposes of
training, benchmarking, and diagnosing learning-based computer vision and
robotics algorithms. In particular, we devise a learning-based pipeline of
algorithms capable of automatically generating and rendering a potentially
infinite variety of indoor scenes by using a stochastic grammar, represented as
an attributed Spatial And-Or Graph, in conjunction with state-of-the-art
physics-based rendering. Our pipeline is capable of synthesizing scene layouts
with high diversity, and it is configurable inasmuch as it enables the precise
customization and control of important attributes of the generated scenes. It
renders photorealistic RGB images of the generated scenes while automatically
synthesizing detailed, per-pixel ground truth data, including visible surface
depth and normal, object identity, and material information (detailed to object
parts), as well as environments (e.g., illuminations and camera viewpoints). We
demonstrate the value of our synthesized dataset, by improving performance in
certain machine-learning-based scene understanding tasks--depth and surface
normal prediction, semantic segmentation, reconstruction, etc.--and by
providing benchmarks for and diagnostics of trained models by modifying object
attributes and scene properties in a controllable manner.Comment: Accepted in IJCV 201
SeeThrough: Finding Chairs in Heavily Occluded Indoor Scene Images
Discovering 3D arrangements of objects from single indoor images is important
given its many applications including interior design, content creation, etc.
Although heavily researched in the recent years, existing approaches break down
under medium or heavy occlusion as the core object detection module starts
failing in absence of directly visible cues. Instead, we take into account
holistic contextual 3D information, exploiting the fact that objects in indoor
scenes co-occur mostly in typical near-regular configurations. First, we use a
neural network trained on real indoor annotated images to extract 2D keypoints,
and feed them to a 3D candidate object generation stage. Then, we solve a
global selection problem among these 3D candidates using pairwise co-occurrence
statistics discovered from a large 3D scene database. We iterate the process
allowing for candidates with low keypoint response to be incrementally detected
based on the location of the already discovered nearby objects. Focusing on
chairs, we demonstrate significant performance improvement over combinations of
state-of-the-art methods, especially for scenes with moderately to severely
occluded objects
Automatic Generation of Constrained Furniture Layouts
Efficient authoring of vast virtual environments hinges on algorithms that
are able to automatically generate content while also being controllable. We
propose a method to automatically generate furniture layouts for indoor
environments. Our method is simple, efficient, human-interpretable and amenable
to a wide variety of constraints. We model the composition of rooms into
classes of objects and learn joint (co-occurrence) statistics from a database
of training layouts. We generate new layouts by performing a sequence of
conditional sampling steps, exploiting the statistics learned from the
database. The generated layouts are specified as 3D object models, along with
their positions and orientations. We show they are of equivalent perceived
quality to the training layouts, and compare favorably to a state-of-the-art
method. We incorporate constraints using a general mechanism -- rejection
sampling -- which provides great flexibility at the cost of extra computation.
We demonstrate the versatility of our method by applying a wide variety of
constraints relevant to real-world applications
Meta-Sim: Learning to Generate Synthetic Datasets
Training models to high-end performance requires availability of large
labeled datasets, which are expensive to get. The goal of our work is to
automatically synthesize labeled datasets that are relevant for a downstream
task. We propose Meta-Sim, which learns a generative model of synthetic scenes,
and obtain images as well as its corresponding ground-truth via a graphics
engine. We parametrize our dataset generator with a neural network, which
learns to modify attributes of scene graphs obtained from probabilistic scene
grammars, so as to minimize the distribution gap between its rendered outputs
and target data. If the real dataset comes with a small labeled validation set,
we additionally aim to optimize a meta-objective, i.e. downstream task
performance. Experiments show that the proposed method can greatly improve
content generation quality over a human-engineered probabilistic scene grammar,
both qualitatively and quantitatively as measured by performance on a
downstream task.Comment: Webpage: https://nv-tlabs.github.io/meta-sim
Joint Layout Estimation and Global Multi-View Registration for Indoor Reconstruction
In this paper, we propose a novel method to jointly solve scene layout
estimation and global registration problems for accurate indoor 3D
reconstruction. Given a sequence of range data, we first build a set of scene
fragments using KinectFusion and register them through pose graph optimization.
Afterwards, we alternate between layout estimation and layout-based global
registration processes in iterative fashion to complement each other. We
extract the scene layout through hierarchical agglomerative clustering and
energy-based multi-model fitting in consideration of noisy measurements. Having
the estimated scene layout in one hand, we register all the range data through
the global iterative closest point algorithm where the positions of 3D points
that belong to the layout such as walls and a ceiling are constrained to be
close to the layout. We experimentally verify the proposed method with the
publicly available synthetic and real-world datasets in both quantitative and
qualitative ways.Comment: Accepted to 2017 IEEE International Conference on Computer Vision
(ICCV
Complete 3D Scene Parsing from an RGBD Image
One major goal of vision is to infer physical models of objects, surfaces,
and their layout from sensors. In this paper, we aim to interpret indoor scenes
from one RGBD image. Our representation encodes the layout of orthogonal walls
and the extent of objects, modeled with CAD-like 3D shapes. We parse both the
visible and occluded portions of the scene and all observable objects,
producing a complete 3D parse. Such a scene interpretation is useful for
robotics and visual reasoning, but difficult to produce due to the well-known
challenge of segmentation, the high degree of occlusion, and the diversity of
objects in indoor scenes. We take a data-driven approach, generating sets of
potential object regions, matching to regions in training images, and
transferring and aligning associated 3D models while encouraging fit to
observations and spatial consistency. We use support inference to aid
interpretation and propose a retrieval scheme that uses convolutional neural
networks (CNNs) to classify regions and retrieve objects with similar shapes.
We demonstrate the performance of our method on our newly annotated NYUd v2
dataset with detailed 3D shapes.Comment: Accepted to International Journal of Computer Vision (IJCV), 2018
arXiv admin note: text overlap with arXiv:1504.0243
Visualizing Natural Language Descriptions: A Survey
A natural language interface exploits the conceptual simplicity and
naturalness of the language to create a high-level user-friendly communication
channel between humans and machines. One of the promising applications of such
interfaces is generating visual interpretations of semantic content of a given
natural language that can be then visualized either as a static scene or a
dynamic animation. This survey discusses requirements and challenges of
developing such systems and reports 26 graphical systems that exploit natural
language interfaces and addresses both artificial intelligence and
visualization aspects. This work serves as a frame of reference to researchers
and to enable further advances in the field.Comment: Due to copyright most of the figures only appear in the journal
versio
The Stixel world: A medium-level representation of traffic scenes
Recent progress in advanced driver assistance systems and the race towards
autonomous vehicles is mainly driven by two factors: (1) increasingly
sophisticated algorithms that interpret the environment around the vehicle and
react accordingly, and (2) the continuous improvements of sensor technology
itself. In terms of cameras, these improvements typically include higher
spatial resolution, which as a consequence requires more data to be processed.
The trend to add multiple cameras to cover the entire surrounding of the
vehicle is not conducive in that matter. At the same time, an increasing number
of special purpose algorithms need access to the sensor input data to correctly
interpret the various complex situations that can occur, particularly in urban
traffic.
By observing those trends, it becomes clear that a key challenge for vision
architectures in intelligent vehicles is to share computational resources. We
believe this challenge should be faced by introducing a representation of the
sensory data that provides compressed and structured access to all relevant
visual content of the scene. The Stixel World discussed in this paper is such a
representation. It is a medium-level model of the environment that is
specifically designed to compress information about obstacles by leveraging the
typical layout of outdoor traffic scenes. It has proven useful for a multitude
of automotive vision applications, including object detection, tracking,
segmentation, and mapping.
In this paper, we summarize the ideas behind the model and generalize it to
take into account multiple dense input streams: the image itself, stereo depth
maps, and semantic class probability maps that can be generated, e.g., by CNNs.
Our generalization is embedded into a novel mathematical formulation for the
Stixel model. We further sketch how the free parameters of the model can be
learned using structured SVMs.Comment: Accepted for publication in Image and Vision Computin
- …