2,968 research outputs found
Stylistic scene enhancement GAN: Mixed stylistic enhancement generation for 3D indoor scenes
In this paper, we present stylistic scene enhancement GAN, SSE-GAN, a conditional Wasserstein GAN-based approach to automatic generation of mixed stylistic enhancements for 3D indoor scenes. An enhancement indicates factors that can influence the style of an indoor scene such as furniture colors and occurrence of small objects. To facilitate network training, we propose a novel enhancement feature encoding method, which represents an enhancement by a multi-one-hot vector, and effectively accommodates different enhancement factors. A Gumbel-Softmax module is introduced in the generator network to enable the generation of high fidelity enhancement features that can better confuse the discriminator. Experiments show that our approach is superior to the other baseline methods and successfully models the relationship between the style distribution and scene enhancements. Thus, although only trained with a dataset of room images in single styles, the trained generator can generate mixed stylistic enhancements by specifying multiple styles as the condition. Our approach is the first to apply a Gumbel-Softmax module in conditional Wasserstein GANs, as well as the first to explore the application of GAN-based models in the scene enhancement field
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
Deep Bilateral Learning for Real-Time Image Enhancement
Performance is a critical challenge in mobile image processing. Given a
reference imaging pipeline, or even human-adjusted pairs of images, we seek to
reproduce the enhancements and enable real-time evaluation. For this, we
introduce a new neural network architecture inspired by bilateral grid
processing and local affine color transforms. Using pairs of input/output
images, we train a convolutional neural network to predict the coefficients of
a locally-affine model in bilateral space. Our architecture learns to make
local, global, and content-dependent decisions to approximate the desired image
transformation. At runtime, the neural network consumes a low-resolution
version of the input image, produces a set of affine transformations in
bilateral space, upsamples those transformations in an edge-preserving fashion
using a new slicing node, and then applies those upsampled transformations to
the full-resolution image. Our algorithm processes high-resolution images on a
smartphone in milliseconds, provides a real-time viewfinder at 1080p
resolution, and matches the quality of state-of-the-art approximation
techniques on a large class of image operators. Unlike previous work, our model
is trained off-line from data and therefore does not require access to the
original operator at runtime. This allows our model to learn complex,
scene-dependent transformations for which no reference implementation is
available, such as the photographic edits of a human retoucher.Comment: 12 pages, 14 figures, Siggraph 201
Towards Learning Low-Light Indoor Semantic Segmentation with Illumination-Invariant Features
Semantic segmentation models are often affected by illumination changes, and fail to predict correct labels. Although there has been a lot of research on indoor semantic segmentation, it has not been studied in low-light environments. In this paper we propose a new framework, LISU, for Low-light Indoor Scene Understanding. We first decompose the low-light images into reflectance and illumination components, and then jointly learn reflectance restoration and semantic segmentation. To train and evaluate the proposed framework, we propose a new data set, namely LLRGBD, which consists of a large synthetic low-light indoor data set (LLRGBD-synthetic) and a small real data set (LLRGBD-real). The experimental results show that the illumination-invariant features effectively improve the performance of semantic segmentation. Compared with the baseline model, the mIoU of the proposed LISU framework has increased by 11.5%. In addition, pre-training on our synthetic data set increases the mIoU by 7.2%. Our data sets and models are available on our project website
Sparse-to-Continuous: Enhancing Monocular Depth Estimation using Occupancy Maps
This paper addresses the problem of single image depth estimation (SIDE),
focusing on improving the quality of deep neural network predictions. In a
supervised learning scenario, the quality of predictions is intrinsically
related to the training labels, which guide the optimization process. For
indoor scenes, structured-light-based depth sensors (e.g. Kinect) are able to
provide dense, albeit short-range, depth maps. On the other hand, for outdoor
scenes, LiDARs are considered the standard sensor, which comparatively provides
much sparser measurements, especially in areas further away. Rather than
modifying the neural network architecture to deal with sparse depth maps, this
article introduces a novel densification method for depth maps, using the
Hilbert Maps framework. A continuous occupancy map is produced based on 3D
points from LiDAR scans, and the resulting reconstructed surface is projected
into a 2D depth map with arbitrary resolution. Experiments conducted with
various subsets of the KITTI dataset show a significant improvement produced by
the proposed Sparse-to-Continuous technique, without the introduction of extra
information into the training stage.Comment: Accepted. (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
- …