33 research outputs found
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs
In this work we present In-Place Activated Batch Normalization (InPlace-ABN)
- a novel approach to drastically reduce the training memory footprint of
modern deep neural networks in a computationally efficient way. Our solution
substitutes the conventionally used succession of BatchNorm + Activation layers
with a single plugin layer, hence avoiding invasive framework surgery while
providing straightforward applicability for existing deep learning frameworks.
We obtain memory savings of up to 50% by dropping intermediate results and by
recovering required information during the backward pass through the inversion
of stored forward results, with only minor increase (0.8-2%) in computation
time. Also, we demonstrate how frequently used checkpointing approaches can be
made computationally as efficient as InPlace-ABN. In our experiments on image
classification, we demonstrate on-par results on ImageNet-1k with
state-of-the-art approaches. On the memory-demanding task of semantic
segmentation, we report results for COCO-Stuff, Cityscapes and Mapillary
Vistas, obtaining new state-of-the-art results on the latter without additional
training data but in a single-scale and -model scenario. Code can be found at
https://github.com/mapillary/inplace_abn
Online Learning with Bayesian Classification Trees
Randomized classification trees are among the most popular machine learning tools and found successful applications in many areas. Although this classifier was originally designed as offline learning algorithm, there has been an increased interest in the last years to provide an online variant. In this paper, we propose an online learning algorithm for classification trees that adheres to Bayesian principles. In contrast to state-of-the-art approaches that produce large forests with complex trees, we aim at constructing small ensembles consisting of shallow trees with high generalization capabilities. Experiments on benchmark machine learning and body part recognition datasets show superior performance over state-of-the-art approaches
Learning Multi-Object Tracking and Segmentation from Automatic Annotations
In this work we contribute a novel pipeline to automatically generate
training data, and to improve over state-of-the-art multi-object tracking and
segmentation (MOTS) methods. Our proposed track mining algorithm turns raw
street-level videos into high-fidelity MOTS training data, is scalable and
overcomes the need of expensive and time-consuming manual annotation
approaches. We leverage state-of-the-art instance segmentation results in
combination with optical flow predictions, also trained on automatically
harvested training data. Our second major contribution is MOTSNet - a deep
learning, tracking-by-detection architecture for MOTS - deploying a novel
mask-pooling layer for improved object association over time. Training MOTSNet
with our automatically extracted data leads to significantly improved sMOTSA
scores on the novel KITTI MOTS dataset (+1.9%/+7.5% on cars/pedestrians), and
MOTSNet improves by +4.1% over previously best methods on the MOTSChallenge
dataset. Our most impressive finding is that we can improve over previous
best-performing works, even in complete absence of manually annotated MOTS
training data
Towards Generalization Across Depth for Monocular 3D Object Detection
While expensive LiDAR and stereo camera rigs have enabled the development of
successful 3D object detection methods, monocular RGB-only approaches lag much
behind. This work advances the state of the art by introducing MoVi-3D, a
novel, single-stage deep architecture for monocular 3D object detection.
MoVi-3D builds upon a novel approach which leverages geometrical information to
generate, both at training and test time, virtual views where the object
appearance is normalized with respect to distance. These virtually generated
views facilitate the detection task as they significantly reduce the visual
appearance variability associated to objects placed at different distances from
the camera. As a consequence, the deep model is relieved from learning
depth-specific representations and its complexity can be significantly reduced.
In particular, in this work we show that, thanks to our virtual views
generation process, a lightweight, single-stage architecture suffices to set
new state-of-the-art results on the popular KITTI3D benchmark
DiffRF: Rendering-Guided 3D Radiance Field Diffusion
We introduce DiffRF, a novel approach for 3D radiance field synthesis based
on denoising diffusion probabilistic models. While existing diffusion-based
methods operate on images, latent codes, or point cloud data, we are the first
to directly generate volumetric radiance fields. To this end, we propose a 3D
denoising model which directly operates on an explicit voxel grid
representation. However, as radiance fields generated from a set of posed
images can be ambiguous and contain artifacts, obtaining ground truth radiance
field samples is non-trivial. We address this challenge by pairing the
denoising formulation with a rendering loss, enabling our model to learn a
deviated prior that favours good image quality instead of trying to replicate
fitting errors like floating artifacts. In contrast to 2D-diffusion models, our
model learns multi-view consistent priors, enabling free-view synthesis and
accurate shape generation. Compared to 3D GANs, our diffusion-based approach
naturally enables conditional generation such as masked completion or
single-view 3D synthesis at inference time.Comment: Project page: https://sirwyver.github.io/DiffRF/ Video:
https://youtu.be/qETBcLu8SUk - CVPR 2023 Highlight - updated evaluations
after fixing initial data mapping error on all method
GANeRF: Leveraging Discriminators to Optimize Neural Radiance Fields
Neural Radiance Fields (NeRF) have shown impressive novel view synthesis
results; nonetheless, even thorough recordings yield imperfections in
reconstructions, for instance due to poorly observed areas or minor lighting
changes. Our goal is to mitigate these imperfections from various sources with
a joint solution: we take advantage of the ability of generative adversarial
networks (GANs) to produce realistic images and use them to enhance realism in
3D scene reconstruction with NeRFs. To this end, we learn the patch
distribution of a scene using an adversarial discriminator, which provides
feedback to the radiance field reconstruction, thus improving realism in a
3D-consistent fashion. Thereby, rendering artifacts are repaired directly in
the underlying 3D representation by imposing multi-view path rendering
constraints. In addition, we condition a generator with multi-resolution NeRF
renderings which is adversarially trained to further improve rendering quality.
We demonstrate that our approach significantly improves rendering quality,
e.g., nearly halving LPIPS scores compared to Nerfacto while at the same time
improving PSNR by 1.4dB on the advanced indoor scenes of Tanks and Temples.Comment: Video: https://youtu.be/EUWW8nUxpl