54 research outputs found
Practical Auto-Calibration for Spatial Scene-Understanding from Crowdsourced Dashcamera Videos
Spatial scene-understanding, including dense depth and ego-motion estimation,
is an important problem in computer vision for autonomous vehicles and advanced
driver assistance systems. Thus, it is beneficial to design perception modules
that can utilize crowdsourced videos collected from arbitrary vehicular onboard
or dashboard cameras. However, the intrinsic parameters corresponding to such
cameras are often unknown or change over time. Typical manual calibration
approaches require objects such as a chessboard or additional scene-specific
information. On the other hand, automatic camera calibration does not have such
requirements. Yet, the automatic calibration of dashboard cameras is
challenging as forward and planar navigation results in critical motion
sequences with reconstruction ambiguities. Structure reconstruction of complete
visual-sequences that may contain tens of thousands of images is also
computationally untenable. Here, we propose a system for practical monocular
onboard camera auto-calibration from crowdsourced videos. We show the
effectiveness of our proposed system on the KITTI raw, Oxford RobotCar, and the
crowdsourced D-City datasets in varying conditions. Finally, we demonstrate
its application for accurate monocular dense depth and ego-motion estimation on
uncalibrated videos.Comment: Accepted at 16th International Conference on Computer Vision Theory
and Applications (VISAP, 2021
3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection
Cameras are a crucial exteroceptive sensor for self-driving cars as they are
low-cost and small, provide appearance information about the environment, and
work in various weather conditions. They can be used for multiple purposes such
as visual navigation and obstacle detection. We can use a surround multi-camera
system to cover the full 360-degree field-of-view around the car. In this way,
we avoid blind spots which can otherwise lead to accidents. To minimize the
number of cameras needed for surround perception, we utilize fisheye cameras.
Consequently, standard vision pipelines for 3D mapping, visual localization,
obstacle detection, etc. need to be adapted to take full advantage of the
availability of multiple cameras rather than treat each camera individually. In
addition, processing of fisheye images has to be supported. In this paper, we
describe the camera calibration and subsequent processing pipeline for
multi-fisheye-camera systems developed as part of the V-Charge project. This
project seeks to enable automated valet parking for self-driving cars. Our
pipeline is able to precisely calibrate multi-camera systems, build sparse 3D
maps for visual navigation, visually localize the car with respect to these
maps, generate accurate dense maps, as well as detect obstacles based on
real-time depth map extraction
Mapping and Semantic Perception for Service Robotics
Para realizar una tarea, los robots deben ser capaces de ubicarse en el entorno. Si un robot no sabe dónde se encuentra, es imposible que sea capaz de desplazarse para alcanzar el objetivo de su tarea. La localización y construcción de mapas simultánea, llamado SLAM, es un problema estudiado en la literatura que ofrece una solución a este problema. El objetivo de esta tesis es desarrollar técnicas que permitan a un robot comprender el entorno mediante la incorporación de información semántica. Esta información también proporcionará una mejora en la localización y navegación de las plataformas robóticas. Además, también demostramos cómo un robot con capacidades limitadas puede construir de forma fiable y eficiente los mapas semánticos necesarios para realizar sus tareas cotidianas.El sistema de construcción de mapas presentado tiene las siguientes características: En el lado de la construcción de mapas proponemos la externalización de cálculos costosos a un servidor en nube. Además, proponemos métodos para registrar información semántica relevante con respecto a los mapas geométricos estimados. En cuanto a la reutilización de los mapas construidos, proponemos un método que combina la construcción de mapas con la navegación de un robot para explorar mejor un entorno y disponer de un mapa semántico con los objetos relevantes para una misión determinada.En primer lugar, desarrollamos un algoritmo semántico de SLAM visual que se fusiona los puntos estimados en el mapa, carentes de sentido, con objetos conocidos. Utilizamos un sistema monocular de SLAM basado en un EKF (Filtro Extendido de Kalman) centrado principalmente en la construcción de mapas geométricos compuestos únicamente por puntos o bordes; pero sin ningún significado o contenido semántico asociado. El mapa no anotado se construye utilizando sólo la información extraída de una secuencia de imágenes monoculares. La parte semántica o anotada del mapa -los objetos- se estiman utilizando la información de la secuencia de imágenes y los modelos de objetos precalculados. Como segundo paso, mejoramos el método de SLAM presentado anteriormente mediante el diseño y la implementación de un método distribuido. La optimización de mapas y el almacenamiento se realiza como un servicio en la nube, mientras que el cliente con poca necesidad de computo, se ejecuta en un equipo local ubicado en el robot y realiza el cálculo de la trayectoria de la cámara. Los ordenadores con los que está equipado el robot se liberan de la mayor parte de los cálculos y el único requisito adicional es una conexión a Internet.El siguiente paso es explotar la información semántica que somos capaces de generar para ver cómo mejorar la navegación de un robot. La contribución en esta tesis se centra en la detección 3D y en el diseño e implementación de un sistema de construcción de mapas semántico.A continuación, diseñamos e implementamos un sistema de SLAM visual capaz de funcionar con robustez en entornos poblados debido a que los robots de servicio trabajan en espacios compartidos con personas. El sistema presentado es capaz de enmascarar las zonas de imagen ocupadas por las personas, lo que aumenta la robustez, la reubicación, la precisión y la reutilización del mapa geométrico. Además, calcula la trayectoria completa de cada persona detectada con respecto al mapa global de la escena, independientemente de la ubicación de la cámara cuando la persona fue detectada.Por último, centramos nuestra investigación en aplicaciones de rescate y seguridad. Desplegamos un equipo de robots en entornos que plantean múltiples retos que implican la planificación de tareas, la planificación del movimiento, la localización y construcción de mapas, la navegación segura, la coordinación y las comunicaciones entre todos los robots. La arquitectura propuesta integra todas las funcionalidades mencionadas, asi como varios aspectos de investigación novedosos para lograr una exploración real, como son: localización basada en características semánticas-topológicas, planificación de despliegue en términos de las características semánticas aprendidas y reconocidas, y construcción de mapas.In order to perform a task, robots need to be able to locate themselves in the environment. If a robot does not know where it is, it is impossible for it to move, reach its goal and complete the task. Simultaneous Localization and Mapping, known as SLAM, is a problem extensively studied in the literature for enabling robots to locate themselves in unknown environments. The goal of this thesis is to develop and describe techniques to allow a service robot to understand the environment by incorporating semantic information. This information will also provide an improvement in the localization and navigation of robotic platforms. In addition, we also demonstrate how a simple robot can reliably and efficiently build the semantic maps needed to perform its quotidian tasks. The mapping system as built has the following features. On the map building side we propose the externalization of expensive computations to a cloud server. Additionally, we propose methods to register relevant semantic information with respect to the estimated geometrical maps. Regarding the reuse of the maps built, we propose a method that combines map building with robot navigation to better explore a room in order to obtain a semantic map with the relevant objects for a given mission. Firstly, we develop a semantic Visual SLAM algorithm that merges traditional with known objects in the estimated map. We use a monocular EKF (Extended Kalman Filter) SLAM system that has mainly been focused on producing geometric maps composed simply of points or edges but without any associated meaning or semantic content. The non-annotated map is built using only the information extracted from an image sequence. The semantic or annotated parts of the map –the objects– are estimated using the information in the image sequence and the precomputed object models. As a second step we improve the EKF SLAM presented previously by designing and implementing a visual SLAM system based on a distributed framework. The expensive map optimization and storage is allocated as a service in the Cloud, while a light camera tracking client runs on a local computer. The robot’s onboard computers are freed from most of the computation, the only extra requirement being an internet connection. The next step is to exploit the semantic information that we are able to generate to see how to improve the navigation of a robot. The contribution of this thesis is focused on 3D sensing which we use to design and implement a semantic mapping system. We then design and implement a visual SLAM system able to perform robustly in populated environments due to service robots work in environments where people are present. The system is able to mask the image regions occupied by people out of the rigid SLAM pipeline, which boosts the robustness, the relocation, the accuracy and the reusability of the geometrical map. In addition, it estimates the full trajectory of each detected person with respect to the scene global map, irrespective of the location of the moving camera at the point when the people were imaged. Finally, we focus our research on rescue and security applications. The deployment of a multirobot team in confined environments poses multiple challenges that involve task planning, motion planning, localization and mapping, safe navigation, coordination and communications among all the robots. The architecture integrates, jointly with all the above-mentioned functionalities, several novel features to achieve real exploration: localization based on semantic-topological features, deployment planning in terms of the semantic features learned and recognized, and map building.<br /
Collaborative Dynamic 3D Scene Graphs for Automated Driving
Maps have played an indispensable role in enabling safe and automated
driving. Although there have been many advances on different fronts ranging
from SLAM to semantics, building an actionable hierarchical semantic
representation of urban dynamic scenes from multiple agents is still a
challenging problem. In this work, we present Collaborative URBan Scene Graphs
(CURB-SG) that enable higher-order reasoning and efficient querying for many
functions of automated driving. CURB-SG leverages panoptic LiDAR data from
multiple agents to build large-scale maps using an effective graph-based
collaborative SLAM approach that detects inter-agent loop closures. To
semantically decompose the obtained 3D map, we build a lane graph from the
paths of ego agents and their panoptic observations of other vehicles. Based on
the connectivity of the lane graph, we segregate the environment into
intersecting and non-intersecting road areas. Subsequently, we construct a
multi-layered scene graph that includes lane information, the position of
static landmarks and their assignment to certain map sections, other vehicles
observed by the ego agents, and the pose graph from SLAM including 3D panoptic
point clouds. We extensively evaluate CURB-SG in urban scenarios using a
photorealistic simulator. We release our code at
http://curb.cs.uni-freiburg.de.Comment: Refined manuscript and extended supplementar
Object-Aware Tracking and Mapping
Reasoning about geometric properties of digital cameras and optical physics enabled
researchers to build methods that localise cameras in 3D space from a video
stream, while – often simultaneously – constructing a model of the environment.
Related techniques have evolved substantially since the 1980s, leading to increasingly
accurate estimations. Traditionally, however, the quality of results is strongly
affected by the presence of moving objects, incomplete data, or difficult surfaces
– i.e. surfaces that are not Lambertian or lack texture. One insight of this work is
that these problems can be addressed by going beyond geometrical and optical constraints,
in favour of object level and semantic constraints. Incorporating specific
types of prior knowledge in the inference process, such as motion or shape priors,
leads to approaches with distinct advantages and disadvantages.
After introducing relevant concepts in Chapter 1 and Chapter 2, methods for building
object-centric maps in dynamic environments using motion priors are investigated
in Chapter 5. Chapter 6 addresses the same problem as Chapter 5, but presents
an approach which relies on semantic priors rather than motion cues. To fully exploit
semantic information, Chapter 7 discusses the conditioning of shape representations
on prior knowledge and the practical application to monocular, object-aware
reconstruction systems
Recommended from our members
Inertial-aided Visual Perception of Geometry and Semantics
We describe components of a visual perception system to understand the geometry and semantics of the three-dimensional scene by utilizing monocular cameras and inertial measurement units (IMUs). The use of the two sensor modalities is motivated by the wide availability of the camera-IMU sensor packages present in mobile devices from phones to cars, and their complementary sensing capabilities: IMUs can track the motion of the sensor platform over a short period of time accurately, and provide a scaled and gravity-aligned global reference frame, while cameras can capture rich photometric signatures of the scene, and provide relative motion constraints between images up to scale. We first show that visual 3D reconstruction can be improved by leveraging the global orientation frame -- easily inferred from inertials. In the gravity-aligned global orientation frame, a shape prior can be imposed in depth prediction from a single image, where the normal vectors to surfaces of objects of certain classes tend to align with gravity or orthogonal to it. Adding such a prior to baseline methods for monocular depth prediction yields improvements beyond the state-of-the-art and illustrates the power of utilizing inertials in 3D reconstruction. The global reference provided by inertials is not only gravity-aligned but also scaled, which is exploited in depth completion: We describe a method to infer dense metric depth from camera motion and sparse depth as estimated using a visual-inertial odometry system. Unlike other scenarios using point clouds from lidar or structured light sensors, we have few hundreds to few thousand points, insufficient to inform the topology of the scene. Our method first constructs a piecewise planar scaffolding of the scene, and then uses it to infer dense depth using the image along with the sparse points. We use a predictive cross-modal criterion, akin to “self-supervision,” measuring photometric consistency across time, forward-backward pose consistency, and geometric compatibility with the sparse point cloud. We also launch the first visual-inertial + depth dataset (dubbed ``VOID''), which we hope will foster additional exploration into combining the complementary strengths of visual and inertial sensors. To compare our method to prior work, we adopt the unsupervised KITTI depth completion benchmark, and show state-of-the-art performance on it.In addition to dense geometry, the camera-IMU sensor package can also be used to recover the semantics of the scene. We present two methods to augment a point cloud map with class-labeled objects represented in the form of either scaled and oriented bounding boxes or CAD models. The tradeoff of the two shape representation resides in their generality and capability to model detailed structures. While being more generic, 3D bounding boxes fail to model the details of the objects, whereas CAD models preserve the finest shape details but require more computation and are limited to previously seen objects. Nevertheless, both methods populate an unknown environment with 3D objects placed in a Euclidean reference frame inferred causally and on-line using monocular video along with inertial sensors. Besides, both methods include bottom-up and top-down components, whereby deep networks trained for detection provide likelihood scores for object hypotheses provided by a nonlinear filter, whose state serves as memory. We test our methods on KITTI and SceneNN datasets, and also introduce the VISMA dataset, which contains ground truth pose, point-cloud map, and object models, along with time-stamped inertial measurements.To reduce the drift of the visual-inertial SLAM system -- a building block of all the visual perception systems we have built, we introduce an efficient loop closure detection approach based on the idea of hierarchical pooling of image descriptors. We also open-sourced a full-fledged SLAM system equipped with mapping and loop closure capabilities. The code is publicly available at https://github.com/ucla-vision/xivo
Mapping and Localization in Urban Environments Using Cameras
In this work we present a system to fully automatically create a highly accurate visual feature map from image data aquired from within a moving vehicle. Moreover, a system for high precision self localization is presented. Furthermore, we present a method to automatically learn a visual descriptor. The map relative self localization is centimeter accurate and allows autonomous driving
Recommended from our members
End to End Learning in Autonomous Driving Systems
Convolutional neural networks have advanced visual perception significantly in recent years. Two major ingredients that enable such a success are the composition of simple modules into a complex network and the end to end optimization. However, such success has not yet revolutionized robotics as much as vision, even if robotics suffer from similar problems as traditional computer vision, i.e. imperfectness of the manual pipeline design of the system. This thesis investigates using end-to-end learning for the autonomous driving system, a concrete robotic application. End to end learning can produce reasonable driving behaviors, even in the complex urban driving scenarios. Representation learning in end-to-end driving models is crucial, and auxiliary vision tasks such as semantic segmentation can help to form a more informative driving representation especially when training data is limited. Naive convolutional neural networks are usually only capable of doing reactive control and can not involve complex reasoning in a particular scenario. This thesis also studies how to handle scene conditioned driving behavior, which goes beyond the capability of reactive control. Alongside the end-to-end structure, learning methods also play a critical role. Imitation learning methods will acquire meaningful behaviors but usually, the robot can not master the skill. Reinforcement learning, on the contrary, either barely learns anything if the environment is too complex, or it can master the skill otherwise. To get the best of both worlds, this thesis proposes an algorithmically unified method to learn from both demonstration data and the environment
Localization in urban environments. A hybrid interval-probabilistic method
Ensuring safety has become a paramount concern with the increasing autonomy of vehicles and the advent of autonomous driving. One of the most fundamental tasks of increased autonomy is localization, which is essential for safe operation. To quantify safety requirements, the concept of integrity has been introduced in aviation, based on the ability of the system to provide timely and correct alerts when the safe operation of the systems can no longer be guaranteed. Therefore, it is necessary to assess the localization's uncertainty to determine the system's operability.
In the literature, probability and set-membership theory are two predominant approaches that provide mathematical tools to assess uncertainty. Probabilistic approaches often provide accurate point-valued results but tend to underestimate the uncertainty. Set-membership approaches reliably estimate the uncertainty but can be overly pessimistic, producing inappropriately large uncertainties and no point-valued results. While underestimating the uncertainty can lead to misleading information and dangerous system failure without warnings, overly pessimistic uncertainty estimates render the system inoperative for practical purposes as warnings are fired more often.
This doctoral thesis aims to study the symbiotic relationship between set-membership-based and probabilistic localization approaches and combine them into a unified hybrid localization approach. This approach enables safe operation while not being overly pessimistic regarding the uncertainty estimation. In the scope of this work, a novel Hybrid Probabilistic- and Set-Membership-based Coarse and Refined (HyPaSCoRe) Localization method is introduced. This method localizes a robot in a building map in real-time and considers two types of hybridizations. On the one hand, set-membership approaches are used to robustify and control probabilistic approaches. On the other hand, probabilistic approaches are used to reduce the pessimism of set-membership approaches by augmenting them with further probabilistic constraints.
The method consists of three modules - visual odometry, coarse localization, and refined localization. The HyPaSCoRe Localization uses a stereo camera system, a LiDAR sensor, and GNSS data, focusing on localization in urban canyons where GNSS data can be inaccurate. The visual odometry module computes the relative motion of the vehicle. In contrast, the coarse localization module uses set-membership approaches to narrow down the feasible set of poses and provides the set of most likely poses inside the feasible set using a probabilistic approach. The refined localization module further refines the coarse localization result by reducing the pessimism of the uncertainty estimate by incorporating probabilistic constraints into the set-membership approach.
The experimental evaluation of the HyPaSCoRe shows that it maintains the integrity of the uncertainty estimation while providing accurate, most likely point-valued solutions in real-time. Introducing this new hybrid localization approach contributes to developing safe and reliable algorithms in the context of autonomous driving
- …