209 research outputs found
Learning Sampling-Based 6D Object Pose Estimation
The task of 6D object pose estimation, i.e. of estimating an object position (three degrees of freedom) and orientation (three degrees of freedom) from images is an essential building block of many modern applications, such as robotic grasping, autonomous driving, or augmented reality. Automatic pose estimation systems have to overcome a variety of visual ambiguities, including texture-less objects, clutter, and occlusion. Since many applications demand real time performance the efficient use of computational resources is an additional challenge.
In this thesis, we will take a probabilistic stance on trying to overcome said issues. We build on a highly successful automatic pose estimation framework based on predicting pixel-wise correspondences between the camera coordinate system and the local coordinate system of the object. These dense correspondences are used to generate a pool of hypotheses, which in turn serve as a starting point in a final search procedure. We will present three systems that each use probabilistic modeling and sampling to improve upon different aspects of the framework.
The goal of the first system, System I, is to enable pose tracking, i.e. estimating the pose of an object in a sequence of frames instead of a single image. By including information from previous frames tracking systems can resolve many visual ambiguities and reduce computation time. System I is a particle filter (PF) approach. The PF represents its belief about the pose in each frame by propagating a set of samples through time. Our system uses the process of hypothesis generation from the original framework as part of a proposal distribution that efficiently concentrates samples in the appropriate areas.
In System II, we focus on the problem of evaluating the quality of pose hypotheses. This task plays an essential role in the final search procedure of the original framework. We use a convolutional neural network (CNN) to assess the quality of an hypothesis by comparing rendered and observed images. To train the CNN we view it as part of an energy-based probability distribution in pose space. This probabilistic perspective allows us to train the system under the maximum likelihood paradigm. We use a sampling approach to approximate the required gradients. The resulting system for pose estimation yields superior results in particular for highly occluded objects.
In System III, we take the idea of machine learning a step further. Instead of learning to predict an hypothesis quality measure, to be used in a search procedure, we present a way of learning the search procedure itself. We train a reinforcement learning (RL) agent, termed PoseAgent, to steer the search process and make optimal use of a given computational budget. PoseAgent dynamically decides which hypothesis should be refined next, and which one should ultimately be output as final estimate. Since the search procedure includes discrete non-differentiable choices, training of the system via gradient descent is not easily possible. To solve the problem, we model behavior of PoseAgent as non-deterministic stochastic policy, which is ultimately governed by a CNN. This allows us to use a sampling-based stochastic policy gradient training procedure.
We believe that some of the ideas developed in this thesis,
such as the sampling-driven probabilistically motivated training of a CNN for the comparison of images or the search procedure implemented by PoseAgent have the potential to be applied in fields beyond pose estimation as well
Direct Unsupervised Denoising
Traditional supervised denoisers are trained using pairs of noisy input and clean target images. They learn to predict a central tendency of the posterior distribution over possible clean images. When, e.g., trained with the popular quadratic loss function, the network's output will correspond to the minimum mean square error (MMSE) estimate. Unsupervised denoisers based on Variational AutoEncoders (VAEs) have succeeded in achieving state-of-the-art results while requiring only unpaired noisy data as training input. In contrast to the traditional supervised approach, unsupervised denoisers do not directly produce a single prediction, such as the MMSE estimate, but allow us to draw samples from the posterior distribution of clean solutions corresponding to the noisy input. To approximate the MMSE estimate during inference, unsupervised methods have to create and draw a large number of samples - a computationally expensive process - rendering the approach inapplicable in many situations. Here, we present an alternative approach that trains a deterministic network alongside the VAE to directly predict a central tendency. Our method achieves results that surpass the results achieved by the unsupervised method at a fraction of the computational cost
Unsupervised Denoising for Signal-Dependent and Row-Correlated Imaging Noise
Accurate analysis of microscopy images is hindered by the presence of noise. This noise is usually signal-dependent and often additionally correlated along rows or columns of pixels. Current self- and unsupervised denoisers can address signal-dependent noise, but none can reliably remove noise that is also row- or column-correlated. Here, we present the first fully unsupervised deep learning-based denoiser capable of handling imaging noise that is row-correlated as well as signal-dependent. Our approach uses a Variational Autoencoder (VAE) with a specially designed autoregressive decoder. This decoder is capable of modeling row-correlated and signal-dependent noise but is incapable of independently modeling underlying clean signal. The VAE therefore produces latent variables containing only clean signal information, and these are mapped back into image space using a proposed second decoder network. Our method does not require a pre-trained noise model and can be trained from scratch using unpaired noisy data. We show that our approach achieves competitive results when applied to a range of different sensor types and imaging modalities
Direct Unsupervised Denoising
Traditional supervised denoisers are trained using pairs of noisy input and
clean target images. They learn to predict a central tendency of the posterior
distribution over possible clean images. When, e.g., trained with the popular
quadratic loss function, the network's output will correspond to the minimum
mean square error (MMSE) estimate. Unsupervised denoisers based on Variational
AutoEncoders (VAEs) have succeeded in achieving state-of-the-art results while
requiring only unpaired noisy data as training input. In contrast to the
traditional supervised approach, unsupervised denoisers do not directly produce
a single prediction, such as the MMSE estimate, but allow us to draw samples
from the posterior distribution of clean solutions corresponding to the noisy
input. To approximate the MMSE estimate during inference, unsupervised methods
have to create and draw a large number of samples - a computationally expensive
process - rendering the approach inapplicable in many situations. Here, we
present an alternative approach that trains a deterministic network alongside
the VAE to directly predict a central tendency. Our method achieves results
that surpass the results achieved by the unsupervised method at a fraction of
the computational cost
PoseAgent: Budget-Constrained 6D Object Pose Estimation via Reinforcement Learning
State-of-the-art computer vision algorithms often achieve efficiency by
making discrete choices about which hypotheses to explore next. This allows
allocation of computational resources to promising candidates, however, such
decisions are non-differentiable. As a result, these algorithms are hard to
train in an end-to-end fashion. In this work we propose to learn an efficient
algorithm for the task of 6D object pose estimation. Our system optimizes the
parameters of an existing state-of-the art pose estimation system using
reinforcement learning, where the pose estimation system now becomes the
stochastic policy, parametrized by a CNN. Additionally, we present an efficient
training algorithm that dramatically reduces computation time. We show
empirically that our learned pose estimation procedure makes better use of
limited resources and improves upon the state-of-the-art on a challenging
dataset. Our approach enables differentiable end-to-end training of complex
algorithmic pipelines and learns to make optimal use of a given computational
budget
Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images
Analysis-by-synthesis has been a successful approach for many tasks in
computer vision, such as 6D pose estimation of an object in an RGB-D image
which is the topic of this work. The idea is to compare the observation with
the output of a forward process, such as a rendered image of the object of
interest in a particular pose. Due to occlusion or complicated sensor noise, it
can be difficult to perform this comparison in a meaningful way. We propose an
approach that "learns to compare", while taking these difficulties into
account. This is done by describing the posterior density of a particular
object pose with a convolutional neural network (CNN) that compares an observed
and rendered image. The network is trained with the maximum likelihood
paradigm. We observe empirically that the CNN does not specialize to the
geometry or appearance of specific objects, and it can be used with objects of
vastly different shapes and appearances, and in different backgrounds. Compared
to state-of-the-art, we demonstrate a significant improvement on two different
datasets which include a total of eleven objects, cluttered background, and
heavy occlusion.Comment: 16 pages, 8 figure
Fully Unsupervised Probabilistic Noise2Void
Image denoising is the first step in many biomedical image analysis pipelines
and Deep Learning (DL) based methods are currently best performing. A new
category of DL methods such as Noise2Void or Noise2Self can be used fully
unsupervised, requiring nothing but the noisy data. However, this comes at the
price of reduced reconstruction quality. The recently proposed Probabilistic
Noise2Void (PN2V) improves results, but requires an additional noise model for
which calibration data needs to be acquired. Here, we present improvements to
PN2V that (i) replace histogram based noise models by parametric noise models,
and (ii) show how suitable noise models can be created even in the absence of
calibration data. This is a major step since it actually renders PN2V fully
unsupervised. We demonstrate that all proposed improvements are not only
academic but indeed relevant.Comment: Accepted at ISBI 202
{\mu}Split: efficient image decomposition for microscopy data
We present {\mu}Split, a dedicated approach for trained image decomposition
in the context of fluorescence microscopy images. We find that best results
using regular deep architectures are achieved when large image patches are used
during training, making memory consumption the limiting factor to further
improving performance. We therefore introduce lateral contextualization (LC), a
memory efficient way to train powerful networks and show that LC leads to
consistent and significant improvements on the task at hand. We integrate LC
with U-Nets, Hierarchical AEs, and Hierarchical VAEs, for which we formulate a
modified ELBO loss. Additionally, LC enables training deeper hierarchical
models than otherwise possible and, interestingly, helps to reduce tiling
artefacts that are inherently impossible to avoid when using tiled VAE
predictions. We apply {\mu}Split to five decomposition tasks, one on a
synthetic dataset, four others derived from real microscopy data. LC achieves
SOTA results (average improvements to the best baseline of 2.36 dB PSNR), while
simultaneously requiring considerably less GPU memory.Comment: Published at ICCV 2023. 10 pages, 7 figures, 9 pages supplement, 8
supplementary figure
DenoiSeg: Joint Denoising and Segmentation
Microscopy image analysis often requires the segmentation of objects, but
training data for this task is typically scarce and hard to obtain. Here we
propose DenoiSeg, a new method that can be trained end-to-end on only a few
annotated ground truth segmentations. We achieve this by extending Noise2Void,
a self-supervised denoising scheme that can be trained on noisy images alone,
to also predict dense 3-class segmentations. The reason for the success of our
method is that segmentation can profit from denoising, especially when
performed jointly within the same network. The network becomes a denoising
expert by seeing all available raw data, while co-learning to segment, even if
only a few segmentation labels are available. This hypothesis is additionally
fueled by our observation that the best segmentation results on high quality
(very low noise) raw data are obtained when moderate amounts of synthetic noise
are added. This renders the denoising-task non-trivial and unleashes the
desired co-learning effect. We believe that DenoiSeg offers a viable way to
circumvent the tremendous hunger for high quality training data and effectively
enables few-shot learning of dense segmentations.Comment: 10 pages, 4 figures, 2 pages supplement (4 figures
- …