439 research outputs found
Multi-task near-field perception for autonomous driving using surround-view fisheye cameras
Die Bildung der Augen führte zum Urknall der Evolution. Die Dynamik änderte sich von einem primitiven Organismus, der auf den Kontakt mit der Nahrung wartete, zu einem Organismus, der durch visuelle Sensoren gesucht wurde. Das menschliche Auge ist eine der raffiniertesten Entwicklungen der Evolution, aber es hat immer noch Mängel. Der Mensch hat über Millionen von Jahren einen biologischen Wahrnehmungsalgorithmus entwickelt, der in der Lage ist, Autos zu fahren, Maschinen zu bedienen, Flugzeuge zu steuern und Schiffe zu navigieren. Die Automatisierung dieser Fähigkeiten für Computer ist entscheidend für verschiedene Anwendungen, darunter selbstfahrende Autos, Augmented Realität und architektonische Vermessung. Die visuelle Nahfeldwahrnehmung im Kontext von selbstfahrenden Autos kann die Umgebung in einem Bereich von 0 - 10 Metern und 360° Abdeckung um das Fahrzeug herum wahrnehmen. Sie ist eine entscheidende Entscheidungskomponente bei der Entwicklung eines sichereren automatisierten Fahrens. Jüngste Fortschritte im Bereich Computer Vision und Deep Learning in Verbindung mit hochwertigen Sensoren wie Kameras und LiDARs haben ausgereifte Lösungen für die visuelle Wahrnehmung hervorgebracht. Bisher stand die Fernfeldwahrnehmung im Vordergrund. Ein weiteres wichtiges Problem ist die begrenzte Rechenleistung, die für die Entwicklung von Echtzeit-Anwendungen zur Verfügung steht. Aufgrund dieses Engpasses kommt es häufig zu einem Kompromiss zwischen Leistung und Laufzeiteffizienz. Wir konzentrieren uns auf die folgenden Themen, um diese anzugehen: 1) Entwicklung von Nahfeld-Wahrnehmungsalgorithmen mit hoher Leistung und geringer Rechenkomplexität für verschiedene visuelle Wahrnehmungsaufgaben wie geometrische und semantische Aufgaben unter Verwendung von faltbaren neuronalen Netzen. 2) Verwendung von Multi-Task-Learning zur Überwindung von Rechenengpässen durch die gemeinsame Nutzung von initialen Faltungsschichten zwischen den Aufgaben und die Entwicklung von Optimierungsstrategien, die die Aufgaben ausbalancieren.The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0 - 10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks
Don't Forget The Past: Recurrent Depth Estimation from Monocular Video
Autonomous cars need continuously updated depth information. Thus far, depth
is mostly estimated independently for a single frame at a time, even if the
method starts from video input. Our method produces a time series of depth
maps, which makes it an ideal candidate for online learning approaches. In
particular, we put three different types of depth estimation (supervised depth
prediction, self-supervised depth prediction, and self-supervised depth
completion) into a common framework. We integrate the corresponding networks
with a ConvLSTM such that the spatiotemporal structures of depth across frames
can be exploited to yield a more accurate depth estimation. Our method is
flexible. It can be applied to monocular videos only or be combined with
different types of sparse depth patterns. We carefully study the architecture
of the recurrent network and its training strategy. We are first to
successfully exploit recurrent networks for real-time self-supervised monocular
depth estimation and completion. Extensive experiments show that our recurrent
method outperforms its image-based counterpart consistently and significantly
in both self-supervised scenarios. It also outperforms previous depth
estimation methods of the three popular groups. Please refer to
https://www.trace.ethz.ch/publications/2020/rec_depth_estimation/ for details.Comment: Please refer to our webpage for details
https://www.trace.ethz.ch/publications/2020/rec_depth_estimation
An Approach Of Features Extraction And Heatmaps Generation Based Upon Cnns And 3D Object Models
The rapid advancements in artificial intelligence have enabled recent progress of self-driving vehicles. However, the dependence on 3D object models and their annotations collected and owned by individual companies has become a major problem for the development of new algorithms. This thesis proposes an approach of directly using graphics models created from open-source datasets as the virtual representation of real-world objects. This approach uses Machine Learning techniques to extract 3D feature points and to create annotations from graphics models for the recognition of dynamic objects, such as cars, and for the verification of stationary and variable objects, such as buildings and trees. Moreover, it generates heat maps for the elimination of stationary/variable objects in real-time images before working on the recognition of dynamic objects. The proposed approach helps to bridge the gap between the virtual and physical worlds and to facilitate the development of new algorithms for self-driving vehicles
Multi-Domain Adaptation for Image Classification, Depth Estimation, and Semantic Segmentation
The appearance of scenes may change for many reasons, including the viewpoint, the time of day, the weather, and the seasons. Traditionally, deep neural networks are trained and evaluated using images from the same scene and domain to avoid the domain gap. Recent advances in domain adaptation have led to a new type of method that bridges such domain gaps and learns from multiple domains.
This dissertation proposes methods for multi-domain adaptation for various computer vision tasks, including image classification, depth estimation, and semantic segmentation. The first work focuses on semi-supervised domain adaptation. I address this semi-supervised setting and propose to use dynamic feature alignment to address both inter- and intra-domain discrepancy. The second work addresses the task of monocular depth estimation in the multi-domain setting. I propose to address this task with a unified approach that includes adversarial knowledge distillation and uncertainty-guided self-supervised reconstruction. The third work considers the problem of semantic segmentation for aerial imagery with diverse environments and viewing geometries. I present CrossSeg: a novel framework that learns a semantic segmentation network that can generalize well in a cross-scene setting with only a few labeled samples. I believe this line of work can be applicable to many domain adaptation scenarios and aerial applications
Self-supervised monocular depth estimation from oblique UAV videos
UAVs have become an essential photogrammetric measurement as they are
affordable, easily accessible and versatile. Aerial images captured from UAVs
have applications in small and large scale texture mapping, 3D modelling,
object detection tasks, DTM and DSM generation etc. Photogrammetric techniques
are routinely used for 3D reconstruction from UAV images where multiple images
of the same scene are acquired. Developments in computer vision and deep
learning techniques have made Single Image Depth Estimation (SIDE) a field of
intense research. Using SIDE techniques on UAV images can overcome the need for
multiple images for 3D reconstruction. This paper aims to estimate depth from a
single UAV aerial image using deep learning. We follow a self-supervised
learning approach, Self-Supervised Monocular Depth Estimation (SMDE), which
does not need ground truth depth or any extra information other than images for
learning to estimate depth. Monocular video frames are used for training the
deep learning model which learns depth and pose information jointly through two
different networks, one each for depth and pose. The predicted depth and pose
are used to reconstruct one image from the viewpoint of another image utilising
the temporal information from videos. We propose a novel architecture with two
2D CNN encoders and a 3D CNN decoder for extracting information from
consecutive temporal frames. A contrastive loss term is introduced for
improving the quality of image generation. Our experiments are carried out on
the public UAVid video dataset. The experimental results demonstrate that our
model outperforms the state-of-the-art methods in estimating the depths.Comment: Submitted to ISPRS Journal of Photogrammetry and Remote Sensin
Saliency-based approaches for multidimensional explainability of deep networks
In deep learning, visualization techniques extract the salient patterns exploited by deep networks to perform a task (e.g. image classification) focusing on single images. These methods allow a better understanding of these complex models, empowering the identification of the most informative parts of the input data. Beyond the deep network understanding, visual saliency is useful for many quantitative reasons and applications, both in the 2D and 3D domains, such as the analysis of the generalization capabilities of a classifier and autonomous navigation. In this thesis, we describe an approach to cope with the interpretability problem of a convolutional neural network and propose our ideas on how to exploit the visualization for applications like image classification and active object recognition. After a brief overview on common visualization methods producing attention/saliency maps, we will address two separate points: firstly, we will describe how visual saliency can be effectively used in the 2D domain (e.g. RGB images) to boost image classification performances: as a matter of fact, visual summaries, i.e. a compact representation of an ensemble of saliency maps, can be used to improve the classification accuracy of a network through summary-driven specializations. Then, we will present a 3D active recognition system that allows to consider different views of a target object, overcoming the single-view hypothesis of classical object recognition, making the classification problem much easier in principle. Here we adopt such attention maps in a quantitative fashion, by building a 3D dense saliency volume which fuses together saliency maps obtained from different viewpoints, obtaining a continuous proxy on which parts of an object are more discriminative for a given classifier. Finally, we will show how to inject this representations in a real world application, so that an agent (e.g. robot) can move knowing the capabilities of its classifier
- …