18 research outputs found
Tempered Adversarial Networks
Generative adversarial networks (GANs) have been shown to produce realistic
samples from high-dimensional distributions, but training them is considered
hard. A possible explanation for training instabilities is the inherent
imbalance between the networks: While the discriminator is trained directly on
both real and fake samples, the generator only has control over the fake
samples it produces since the real data distribution is fixed by the choice of
a given dataset. We propose a simple modification that gives the generator
control over the real samples which leads to a tempered learning process for
both generator and discriminator. The real data distribution passes through a
lens before being revealed to the discriminator, balancing the generator and
discriminator by gradually revealing more detailed features necessary to
produce high-quality results. The proposed module automatically adjusts the
learning process to the current strength of the networks, yet is generic and
easy to add to any GAN variant. In a number of experiments, we show that this
can improve quality, stability and/or convergence speed across a range of
different GAN architectures (DCGAN, LSGAN, WGAN-GP).Comment: accepted to ICML 201
Frame-Recurrent Video Super-Resolution
Recent advances in video super-resolution have shown that convolutional
neural networks combined with motion compensation are able to merge information
from multiple low-resolution (LR) frames to generate high-quality images.
Current state-of-the-art methods process a batch of LR frames to generate a
single high-resolution (HR) frame and run this scheme in a sliding window
fashion over the entire video, effectively treating the problem as a large
number of separate multi-frame super-resolution tasks. This approach has two
main weaknesses: 1) Each input frame is processed and warped multiple times,
increasing the computational cost, and 2) each output frame is estimated
independently conditioned on the input frames, limiting the system's ability to
produce temporally consistent results.
In this work, we propose an end-to-end trainable frame-recurrent video
super-resolution framework that uses the previously inferred HR estimate to
super-resolve the subsequent frame. This naturally encourages temporally
consistent results and reduces the computational cost by warping only one image
in each step. Furthermore, due to its recurrent nature, the proposed method has
the ability to assimilate a large number of previous frames without increased
computational demands. Extensive evaluations and comparisons with previous
methods validate the strengths of our approach and demonstrate that the
proposed framework is able to significantly outperform the current state of the
art.Comment: Accepted at CVPR 201
RePAST: Relative Pose Attention Scene Representation Transformer
The Scene Representation Transformer (SRT) is a recent method to render novel
views at interactive rates. Since SRT uses camera poses with respect to an
arbitrarily chosen reference camera, it is not invariant to the order of the
input views. As a result, SRT is not directly applicable to large-scale scenes
where the reference frame would need to be changed regularly. In this work, we
propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference
frame at the input, we inject pairwise relative camera pose information
directly into the attention mechanism of the Transformers. This leads to a
model that is by definition invariant to the choice of any global reference
frame, while still retaining the full capabilities of the original method.
Empirical results show that adding this invariance to the model does not lead
to a loss in quality. We believe that this is a step towards applying fully
latent transformer-based rendering methods to large-scale scenes
Sensitivity of Slot-Based Object-Centric Models to their Number of Slots
Self-supervised methods for learning object-centric representations have
recently been applied successfully to various datasets. This progress is
largely fueled by slot-based methods, whose ability to cluster visual scenes
into meaningful objects holds great promise for compositional generalization
and downstream learning. In these methods, the number of slots (clusters)
is typically chosen to match the number of ground-truth objects in the data,
even though this quantity is unknown in real-world settings. Indeed, the
sensitivity of slot-based methods to , and how this affects their learned
correspondence to objects in the data has largely been ignored in the
literature. In this work, we address this issue through a systematic study of
slot-based methods. We propose using analogs to precision and recall based on
the Adjusted Rand Index to accurately quantify model behavior over a large
range of . We find that, especially during training, incorrect choices of
do not yield the desired object decomposition and, in fact, cause
substantial oversegmentation or merging of separate objects
(undersegmentation). We demonstrate that the choice of the objective function
and incorporating instance-level annotations can moderately mitigate this
behavior while still falling short of fully resolving this issue. Indeed, we
show how this issue persists across multiple methods and datasets and stress
its importance for future slot-based models
DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
Visual understanding of the world goes beyond the semantics and flat
structure of individual images. In this work, we aim to capture both the 3D
structure and dynamics of real-world scenes from monocular real-world videos.
Our Dynamic Scene Transformer (DyST) model leverages recent work in neural
scene representation to learn a latent decomposition of monocular real-world
videos into scene content, per-view scene dynamics, and camera pose. This
separation is achieved through a novel co-training scheme on monocular videos
and our new synthetic dataset DySO. DyST learns tangible latent representations
for dynamic scenes that enable view generation with separate control over the
camera and the content of the scene.Comment: Project website: https://dyst-paper.github.io
DORSal: Diffusion for Object-centric Representations of Scenes et al
Recent progress in 3D scene understanding enables scalable learning of
representations across large datasets of diverse scenes. As a consequence,
generalization to unseen scenes and objects, rendering novel views from just a
single or a handful of input images, and controllable scene generation that
supports editing, is now possible. However, training jointly on a large number
of scenes typically compromises rendering quality when compared to single-scene
optimized models such as NeRFs. In this paper, we leverage recent progress in
diffusion models to equip 3D scene representation learning models with the
ability to render high-fidelity novel views, while retaining benefits such as
object-level scene editing to a large degree. In particular, we propose DORSal,
which adapts a video diffusion architecture for 3D scene generation conditioned
on frozen object-centric slot-based representations of scenes. On both complex
synthetic multi-object scenes and on the real-world large-scale Street View
dataset, we show that DORSal enables scalable neural rendering of 3D scenes
with object-level editing and improves upon existing approaches.Comment: Project page: https://www.sjoerdvansteenkiste.com/dorsa
RUST: Latent Neural Scene Representations from Unposed Imagery
Inferring the structure of 3D scenes from 2D observations is a fundamental
challenge in computer vision. Recently popularized approaches based on neural
scene representations have achieved tremendous impact and have been applied
across a variety of applications. One of the major remaining challenges in this
space is training a single model which can provide latent representations which
effectively generalize beyond a single scene. Scene Representation Transformer
(SRT) has shown promise in this direction, but scaling it to a larger set of
diverse scenes is challenging and necessitates accurately posed ground truth
data. To address this problem, we propose RUST (Really Unposed Scene
representation Transformer), a pose-free approach to novel view synthesis
trained on RGB images alone. Our main insight is that one can train a Pose
Encoder that peeks at the target image and learns a latent pose embedding which
is used by the decoder for view synthesis. We perform an empirical
investigation into the learned latent pose structure and show that it allows
meaningful test-time camera transformations and accurate explicit pose
readouts. Perhaps surprisingly, RUST achieves similar quality as methods which
have access to perfect camera pose, thereby unlocking the potential for
large-scale training of amortized neural scene representations.Comment: CVPR 2023 Highlight. Project website: https://rust-paper.github.io