18 research outputs found
Raw or Cooked? Object Detection on RAW Images
Images fed to a deep neural network have in general undergone several
handcrafted image signal processing (ISP) operations, all of which have been
optimized to produce visually pleasing images. In this work, we investigate the
hypothesis that the intermediate representation of visually pleasing images is
sub-optimal for downstream computer vision tasks compared to the RAW image
representation. We suggest that the operations of the ISP instead should be
optimized towards the end task, by learning the parameters of the operations
jointly during training. We extend previous works on this topic and propose a
new learnable operation that enables an object detector to achieve superior
performance when compared to both previous works and traditional RGB images. In
experiments on the open PASCALRAW dataset, we empirically confirm our
hypothesis
Learning Future Object Prediction with a Spatiotemporal Detection Transformer
We explore future object prediction -- a challenging problem where all
objects visible in a future video frame are to be predicted. We propose to
tackle this problem end-to-end by training a detection transformer to directly
output future objects. In order to make accurate predictions about the future,
it is necessary to capture the dynamics in the scene, both of other objects and
of the ego-camera. We extend existing detection transformers in two ways to
capture the scene dynamics. First, we experiment with three different
mechanisms that enable the model to spatiotemporally process multiple frames.
Second, we feed ego-motion information to the model via cross-attention. We
show that both of these cues substantially improve future object prediction
performance. Our final approach learns to capture the dynamics and make
predictions on par with an oracle for 100 ms prediction horizons, and
outperform baselines for longer prediction horizons.Comment: 15 pages, 6 figure
Hinge-Wasserstein: Mitigating Overconfidence in Regression by Classification
Modern deep neural networks are prone to being overconfident despite their
drastically improved performance. In ambiguous or even unpredictable real-world
scenarios, this overconfidence can pose a major risk to the safety of
applications. For regression tasks, the regression-by-classification approach
has the potential to alleviate these ambiguities by instead predicting a
discrete probability density over the desired output. However, a density
estimator still tends to be overconfident when trained with the common NLL
loss. To mitigate the overconfidence problem, we propose a loss function,
hinge-Wasserstein, based on the Wasserstein Distance. This loss significantly
improves the quality of both aleatoric and epistemic uncertainty, compared to
previous work. We demonstrate the capabilities of the new loss on a synthetic
dataset, where both types of uncertainty are controlled separately. Moreover,
as a demonstration for real-world scenarios, we evaluate our approach on the
benchmark dataset Horizon Lines in the Wild. On this benchmark, using the
hinge-Wasserstein loss reduces the Area Under Sparsification Error (AUSE) for
horizon parameters slope and offset, by 30.47% and 65.00%, respectively
You can have your ensemble and run it too -- Deep Ensembles Spread Over Time
Ensembles of independently trained deep neural networks yield uncertainty
estimates that rival Bayesian networks in performance. They also offer sizable
improvements in terms of predictive performance over single models. However,
deep ensembles are not commonly used in environments with limited computational
budget -- such as autonomous driving -- since the complexity grows linearly
with the number of ensemble members. An important observation that can be made
for robotics applications, such as autonomous driving, is that data is
typically sequential. For instance, when an object is to be recognized, an
autonomous vehicle typically observes a sequence of images, rather than a
single image. This raises the question, could the deep ensemble be spread over
time?
In this work, we propose and analyze Deep Ensembles Spread Over Time (DESOT).
The idea is to apply only a single ensemble member to each data point in the
sequence, and fuse the predictions over a sequence of data points. We implement
and experiment with DESOT for traffic sign classification, where sequences of
tracked image patches are to be classified. We find that DESOT obtains the
benefits of deep ensembles, in terms of predictive and uncertainty estimation
performance, while avoiding the added computational cost. Moreover, DESOT is
simple to implement and does not require sequences during training. Finally, we
find that DESOT, like deep ensembles, outperform single models for
out-of-distribution detection
A Generative Appearance Model for End-to-end Video Object Segmentation
One of the fundamental challenges in video object segmentation is to find an
effective representation of the target and background appearance. The best
performing approaches resort to extensive fine-tuning of a convolutional neural
network for this purpose. Besides being prohibitively expensive, this strategy
cannot be truly trained end-to-end since the online fine-tuning procedure is
not integrated into the offline training of the network.
To address these issues, we propose a network architecture that learns a
powerful representation of the target and background appearance in a single
forward pass. The introduced appearance module learns a probabilistic
generative model of target and background feature distributions. Given a new
image, it predicts the posterior class probabilities, providing a highly
discriminative cue, which is processed in later network modules. Both the
learning and prediction stages of our appearance module are fully
differentiable, enabling true end-to-end training of the entire segmentation
pipeline. Comprehensive experiments demonstrate the effectiveness of the
proposed approach on three video object segmentation benchmarks. We close the
gap to approaches based on online fine-tuning on DAVIS17, while operating at 15
FPS on a single GPU. Furthermore, our method outperforms all published
approaches on the large-scale YouTube-VOS dataset
Towards trustworthy multi-modal motion prediction: Holistic evaluation and interpretability of outputs
Predicting the motion of other road agents enables autonomous vehicles to
perform safe and efficient path planning. This task is very complex, as the
behaviour of road agents depends on many factors and the number of possible
future trajectories can be considerable (multi-modal). Most prior approaches
proposed to address multi-modal motion prediction are based on complex machine
learning systems that have limited interpretability. Moreover, the metrics used
in current benchmarks do not evaluate all aspects of the problem, such as the
diversity and admissibility of the output. In this work, we aim to advance
towards the design of trustworthy motion prediction systems, based on some of
the requirements for the design of Trustworthy Artificial Intelligence. We
focus on evaluation criteria, robustness, and interpretability of outputs.
First, we comprehensively analyse the evaluation metrics, identify the main
gaps of current benchmarks, and propose a new holistic evaluation framework. We
then introduce a method for the assessment of spatial and temporal robustness
by simulating noise in the perception system. To enhance the interpretability
of the outputs and generate more balanced results in the proposed evaluation
framework, we propose an intent prediction layer that can be attached to
multi-modal motion prediction models. The effectiveness of this approach is
assessed through a survey that explores different elements in the visualization
of the multi-modal trajectories and intentions. The proposed approach and
findings make a significant contribution to the development of trustworthy
motion prediction systems for autonomous vehicles, advancing the field towards
greater safety and reliability.Comment: 16 pages, 7 figures, 6 table
Visual Tracking with Deformable Continuous Convolution Operators
Visual Object Tracking is the computer vision problem of estimating a target trajectory in a video given only its initial state. A visual tracker often acts as a component in the intelligent vision systems seen in for instance surveillance, autonomous vehicles or robots, and unmanned aerial vehicles. Applications may require robust tracking performance on difficult sequences depicting targets undergoing large changes in appearance, while enforcing a real-time constraint. Discriminative correlation filters have shown promising tracking performance in recent years, and consistently improved state-of-the-art. With the advent of deep learning, new robust deep features have improved tracking performance considerably. However, methods based on discriminative correlation filters learn a rigid template describing the target appearance. This implies an assumption of target rigidity which is not fulfilled in practice. This thesis introduces an approach which integrates deformability into a stateof-the-art tracker. The approach is thoroughly tested on three challenging visual tracking benchmarks, achieving state-of-the-art performance
Visual Tracking with Deformable Continuous Convolution Operators
Visual Object Tracking is the computer vision problem of estimating a target trajectory in a video given only its initial state. A visual tracker often acts as a component in the intelligent vision systems seen in for instance surveillance, autonomous vehicles or robots, and unmanned aerial vehicles. Applications may require robust tracking performance on difficult sequences depicting targets undergoing large changes in appearance, while enforcing a real-time constraint. Discriminative correlation filters have shown promising tracking performance in recent years, and consistently improved state-of-the-art. With the advent of deep learning, new robust deep features have improved tracking performance considerably. However, methods based on discriminative correlation filters learn a rigid template describing the target appearance. This implies an assumption of target rigidity which is not fulfilled in practice. This thesis introduces an approach which integrates deformability into a stateof-the-art tracker. The approach is thoroughly tested on three challenging visual tracking benchmarks, achieving state-of-the-art performance
Dynamic Visual Learning
Autonomous robots act in a \emph{dynamic} world where both the robots and other objects may move. The surround sensing systems of said robots therefore work with dynamic input data and need to estimate both the current state of the environment as well as its dynamics. One of the key elements to obtain a high-level understanding of the environment is to track dynamic objects. This enables the system to understand what the objects are doing; predict where they will be in the future; and in the future better estimate where they are. In this thesis, I focus on input from visual cameras, images. Images have, with the advent of neural networks, become a cornerstone in sensing systems. Image-processing neural networks are optimized to perform a specific computer vision task -- such as recognizing cats and dogs -- on vast datasets of annotated examples. This is usually referred to as \emph{offline training} and given a well-designed neural network, enough high-quality data, and a suitable offline training formulation, the neural network is expected to become adept at the specific task. This thesis starts with a study of object tracking. The tracking is based on the visual appearance of the object, achieved via discriminative convolution filters (DCFs). The first contribution of this thesis is to decompose the filter into multiple subfilters. This serves to increase the robustness during object deformations or rotations. Moreover, it provides a more fine-grained representation of the object state as the subfilters are expected to roughly track object parts. In the second contribution, a neural network is trained directly for object tracking. In order to obtain a fine-grained representation of the object state, it is represented as a segmentation. The main challenge lies in the design of a neural network able to tackle this task. While the common neural networks excel at recognizing patterns seen during offline training, they struggle to store novel patterns in order to later recognize them. To overcome this limitation, a novel appearance learning mechanism is proposed. The mechanism extends the state-of-the-art and is shown to generalize remarkably well to novel data. In the third contribution, the method is used together with a novel fusion strategy and failure detection criterion to semi-automatically annotate visual and thermal videos. Sensing systems need not only track objects, but also detect them. The fourth contribution of this thesis strives to tackle joint detection, tracking, and segmentation of all objects from a predefined set of object classes. The challenge here lies not only in the neural network design, but also in the design of the offline training formulation. The final approach, a recurrent graph neural network, outperforms prior works that have a runtime of the same order of magnitude. Last, this thesis studies \emph{dynamic} learning of novel visual concepts. It is observed that the learning mechanisms used for object tracking essentially learns the appearance of the tracked object. It is natural to ask whether this appearance learning could be extended beyond individual objects to entire semantic classes, enabling the system to learn new concepts based on just a few training examples. Such an ability is desirable in autonomous systems as it removes the need of manually annotating thousands of examples of each class that needs recognition. Instead, the system is trained to efficiently learn to recognize new classes. In the fifth contribution, we propose a novel learning mechanism based on Gaussian process regression. With this mechanism, our neural network outperforms the state-of-the-art and the performance gap is especially large when multiple training examples are given. To summarize, this thesis studies and makes several contributions to learning systems that parse dynamic visuals and that dynamically learn visual appearances or concepts.WASP Industrial PhD studen