89 research outputs found
SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
In fisheye images, rich distinct distortion patterns are regularly
distributed in the image plane. These distortion patterns are independent of
the visual content and provide informative cues for rectification. To make the
best of such rectification cues, we introduce SimFIR, a simple framework for
fisheye image rectification based on self-supervised representation learning.
Technically, we first split a fisheye image into multiple patches and extract
their representations with a Vision Transformer (ViT). To learn fine-grained
distortion representations, we then associate different image patches with
their specific distortion patterns based on the fisheye model, and further
subtly design an innovative unified distortion-aware pretext task for their
learning. The transfer performance on the downstream rectification task is
remarkably boosted, which verifies the effectiveness of the learned
representations. Extensive experiments are conducted, and the quantitative and
qualitative results demonstrate the superiority of our method over the
state-of-the-art algorithms as well as its strong generalization ability on
real-world fisheye images.Comment: Accepted to ICCV 202
UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models
In classical computer vision, rectification is an integral part of multi-view
depth estimation. It typically includes epipolar rectification and lens
distortion correction. This process simplifies the depth estimation
significantly, and thus it has been adopted in CNN approaches. However,
rectification has several side effects, including a reduced field of view
(FOV), resampling distortion, and sensitivity to calibration errors. The
effects are particularly pronounced in case of significant distortion (e.g.,
wide-angle fisheye cameras). In this paper, we propose a generic scale-aware
self-supervised pipeline for estimating depth, euclidean distance, and visual
odometry from unrectified monocular videos. We demonstrate a similar level of
precision on the unrectified KITTI dataset with barrel distortion comparable to
the rectified KITTI dataset. The intuition being that the rectification step
can be implicitly absorbed within the CNN model, which learns the distortion
model without increasing complexity. Our approach does not suffer from a
reduced field of view and avoids computational costs for rectification at
inference time. To further illustrate the general applicability of the proposed
framework, we apply it to wide-angle fisheye cameras with 190
horizontal field of view. The training framework UnRectDepthNet takes in the
camera distortion model as an argument and adapts projection and unprojection
functions accordingly. The proposed algorithm is evaluated further on the KITTI
rectified dataset, and we achieve state-of-the-art results that improve upon
our previous work FisheyeDistanceNet. Qualitative results on a distorted test
scene video sequence indicate excellent performance
https://youtu.be/K6pbx3bU4Ss.Comment: Minor fixes added after IROS 2020 Camera ready submission. IROS 2020
presentation video - https://www.youtube.com/watch?v=3Br2KSWZRr
Multi-task near-field perception for autonomous driving using surround-view fisheye cameras
Die Bildung der Augen führte zum Urknall der Evolution. Die Dynamik änderte sich von einem primitiven Organismus, der auf den Kontakt mit der Nahrung wartete, zu einem Organismus, der durch visuelle Sensoren gesucht wurde. Das menschliche Auge ist eine der raffiniertesten Entwicklungen der Evolution, aber es hat immer noch Mängel. Der Mensch hat über Millionen von Jahren einen biologischen Wahrnehmungsalgorithmus entwickelt, der in der Lage ist, Autos zu fahren, Maschinen zu bedienen, Flugzeuge zu steuern und Schiffe zu navigieren. Die Automatisierung dieser Fähigkeiten für Computer ist entscheidend für verschiedene Anwendungen, darunter selbstfahrende Autos, Augmented Realität und architektonische Vermessung. Die visuelle Nahfeldwahrnehmung im Kontext von selbstfahrenden Autos kann die Umgebung in einem Bereich von 0 - 10 Metern und 360° Abdeckung um das Fahrzeug herum wahrnehmen. Sie ist eine entscheidende Entscheidungskomponente bei der Entwicklung eines sichereren automatisierten Fahrens. Jüngste Fortschritte im Bereich Computer Vision und Deep Learning in Verbindung mit hochwertigen Sensoren wie Kameras und LiDARs haben ausgereifte Lösungen für die visuelle Wahrnehmung hervorgebracht. Bisher stand die Fernfeldwahrnehmung im Vordergrund. Ein weiteres wichtiges Problem ist die begrenzte Rechenleistung, die für die Entwicklung von Echtzeit-Anwendungen zur Verfügung steht. Aufgrund dieses Engpasses kommt es häufig zu einem Kompromiss zwischen Leistung und Laufzeiteffizienz. Wir konzentrieren uns auf die folgenden Themen, um diese anzugehen: 1) Entwicklung von Nahfeld-Wahrnehmungsalgorithmen mit hoher Leistung und geringer Rechenkomplexität für verschiedene visuelle Wahrnehmungsaufgaben wie geometrische und semantische Aufgaben unter Verwendung von faltbaren neuronalen Netzen. 2) Verwendung von Multi-Task-Learning zur Überwindung von Rechenengpässen durch die gemeinsame Nutzung von initialen Faltungsschichten zwischen den Aufgaben und die Entwicklung von Optimierungsstrategien, die die Aufgaben ausbalancieren.The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0 - 10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks
ChiTransformer:Towards Reliable Stereo from Cues
Current stereo matching techniques are challenged by restricted searching
space, occluded regions and sheer size. While single image depth estimation is
spared from these challenges and can achieve satisfactory results with the
extracted monocular cues, the lack of stereoscopic relationship renders the
monocular prediction less reliable on its own, especially in highly dynamic or
cluttered environments. To address these issues in both scenarios, we present
an optic-chiasm-inspired self-supervised binocular depth estimation method,
wherein a vision transformer (ViT) with gated positional cross-attention (GPCA)
layers is designed to enable feature-sensitive pattern retrieval between views
while retaining the extensive context information aggregated through
self-attentions. Monocular cues from a single view are thereafter conditionally
rectified by a blending layer with the retrieved pattern pairs. This crossover
design is biologically analogous to the optic-chasma structure in the human
visual system and hence the name, ChiTransformer. Our experiments show that
this architecture yields substantial improvements over state-of-the-art
self-supervised stereo approaches by 11%, and can be used on both rectilinear
and non-rectilinear (e.g., fisheye) images.Comment: 11 pages, 3 figures, CVPR202
- …