9 research outputs found

    Semantics and Planar Geometry for self-supervised Road Scene Understanding

    Get PDF
    In this thesis we leverage domain knowledge, specifically of road scenes, to provide a self-supervision signal, reduce the labelling requirements, improve the convergence of training and introduce interpretable parameters based on vastly simplified models. Specifically, we chose to research the value of applying domain knowledge to the popular tasks of semantic segmentation and relative pose estimation towards better understanding road scenes. In particular we leverage semantic and geometric scene understanding separately in the first two contributions and then seek to combine them in the third contribution. Firstly, we show that hierarchical structure in class labels for training networks for tasks such as semantic segmentation can be useful for boosting performance and accelerating training. Moreover, we present a hierarchical loss implementation which differentiates between minor and serious errors, and evaluate our method on the Vistas road scene dataset. Secondly, for the task of self-supervised monocular relative pose estimation, we propose a ground-relative formulation for network output which roots our problem in a locally planar geometry. Current self-supervised methods generally require over-parameterised training of both a pose and depth network, and our method entirely replaces the need for depth estimation, while obtaining competitive results on the KITTI visual odometry dataset, dramatically simplifying the problem. Thirdly, we combine semantics with our geometric formulation by extracting the road plane with semantic segmentation and robustly fitting homographies to fine-scale correspondences between coarsely aligned image pairs. We show that with aid from our geometric knowledge and a known analytical method, we can decompose these homographies into camera-relative pose, providing a self-supervision signal that significantly improves our visual odometry performance at both training and test time. In particular, we form a non-differentiable module which computes real-time pseudo-labels, avoiding training complexity, and additionally allowing for test-time performance boosting, helping tackle bias present with deep learning methods

    Learning, Moving, And Predicting With Global Motion Representations

    Get PDF
    In order to effectively respond to and influence the world they inhabit, animals and other intelligent agents must understand and predict the state of the world and its dynamics. An agent that can characterize how the world moves is better equipped to engage it. Current methods of motion computation rely on local representations of motion (such as optical flow) or simple, rigid global representations (such as camera motion). These methods are useful, but they are difficult to estimate reliably and limited in their applicability to real-world settings, where agents frequently must reason about complex, highly nonrigid motion over long time horizons. In this dissertation, I present methods developed with the goal of building more flexible and powerful notions of motion needed by agents facing the challenges of a dynamic, nonrigid world. This work is organized around a view of motion as a global phenomenon that is not adequately addressed by local or low-level descriptions, but that is best understood when analyzed at the level of whole images and scenes. I develop methods to: (i) robustly estimate camera motion from noisy optical flow estimates by exploiting the global, statistical relationship between the optical flow field and camera motion under projective geometry; (ii) learn representations of visual motion directly from unlabeled image sequences using learning rules derived from a formulation of image transformation in terms of its group properties; (iii) predict future frames of a video by learning a joint representation of the instantaneous state of the visual world and its motion, using a view of motion as transformations of world state. I situate this work in the broader context of ongoing computational and biological investigations into the problem of estimating motion for intelligent perception and action

    Bridging the gap between reconstruction and synthesis

    Get PDF
    Aplicat embargament des de la data de defensa fins el 15 de gener de 20223D reconstruction and image synthesis are two of the main pillars in computer vision. Early works focused on simple tasks such as multi-view reconstruction and texture synthesis. With the spur of Deep Learning, the field has rapidly progressed, making it possible to achieve more complex and high level tasks. For example, the 3D reconstruction results of traditional multi-view approaches are currently obtained with single view methods. Similarly, early pattern based texture synthesis works have resulted in techniques that allow generating novel high-resolution images. In this thesis we have developed a hierarchy of tools that cover all these range of problems, lying at the intersection of computer vision, graphics and machine learning. We tackle the problem of 3D reconstruction and synthesis in the wild. Importantly, we advocate for a paradigm in which not everything should be learned. Instead of applying Deep Learning naively we propose novel representations, layers and architectures that directly embed prior 3D geometric knowledge for the task of 3D reconstruction and synthesis. We apply these techniques to problems including scene/person reconstruction and photo-realistic rendering. We first address methods to reconstruct a scene and the clothed people in it while estimating the camera position. Then, we tackle image and video synthesis for clothed people in the wild. Finally, we bridge the gap between reconstruction and synthesis under the umbrella of a unique novel formulation. Extensive experiments conducted along this thesis show that the proposed techniques improve the performance of Deep Learning models in terms of the quality of the reconstructed 3D shapes / synthesised images, while reducing the amount of supervision and training data required to train them. In summary, we provide a variety of low, mid and high level algorithms that can be used to incorporate prior knowledge into different stages of the Deep Learning pipeline and improve performance in tasks of 3D reconstruction and image synthesis.La reconstrucció 3D i la síntesi d'imatges són dos dels pilars fonamentals en visió per computador. Els estudis previs es centren en tasques senzilles com la reconstrucció amb informació multi-càmera i la síntesi de textures. Amb l'aparició del "Deep Learning", aquest camp ha progressat ràpidament, fent possible assolir tasques molt més complexes. Per exemple, per obtenir una reconstrucció 3D, tradicionalment s'utilitzaven mètodes multi-càmera, en canvi ara, es poden obtenir a partir d'una sola imatge. De la mateixa manera, els primers treballs de síntesi de textures basats en patrons han donat lloc a tècniques que permeten generar noves imatges completes en alta resolució. En aquesta tesi, hem desenvolupat una sèrie d'eines que cobreixen tot aquest ventall de problemes, situats en la intersecció entre la visió per computador, els gràfics i l'aprenentatge automàtic. Abordem el problema de la reconstrucció i la síntesi 3D en el món real. És important destacar que defensem un paradigma on no tot s'ha d'aprendre. Enlloc d'aplicar el "Deep Learning" de forma naïve, proposem representacions novedoses i arquitectures que incorporen directament els coneixements geomètrics ja existents per a aconseguir la reconstrucció 3D i la síntesi d'imatges. Nosaltres apliquem aquestes tècniques a problemes com ara la reconstrucció d'escenes/persones i a la renderització d'imatges fotorealistes. Primer abordem els mètodes per reconstruir una escena, les persones vestides que hi ha i la posició de la càmera. A continuació, abordem la síntesi d'imatges i vídeos de persones vestides en situacions quotidianes. I finalment, aconseguim, a través d'una nova formulació única, connectar la reconstrucció amb la síntesi. Els experiments realitzats al llarg d'aquesta tesi demostren que les tècniques proposades milloren el rendiment dels models de "Deepp Learning" pel que fa a la qualitat de les reconstruccions i les imatges sintetitzades alhora que redueixen la quantitat de dades necessàries per entrenar-los. En resum, proporcionem una varietat d'algoritmes de baix, mitjà i alt nivell que es poden utilitzar per incorporar els coneixements previs a les diferents etapes del "Deep Learning" i millorar el rendiment en tasques de reconstrucció 3D i síntesi d'imatges.Postprint (published version

    Robust and Efficient Camera-based Scene Reconstruction

    Get PDF
    For the simultaneous reconstruction of 3D scene geometry and camera poses from images or videos, there are two major approaches: On the one hand it is possible to perform a sparse reconstruction by extracting recognizable features from multiple images which correspond to the same 3D points in the scene. With those features, the positions of the 3D points as well as the camera poses can be obtained such that they explain the positions of the features in the images best. On the other hand, on video data, a dense reconstruction can be obtained by alternating between the tracking of the camera pose and updating a depth map representing the scene per frame of the video. In this dissertation, we introduce several improvements to both reconstruction strategies. We start from improving the reliability of image feature matches which leads to faster and more robust subsequent processing. Then, we present a sparse reconstruction pipeline completely optimized for high resolution and high frame rate video, exploiting the redundancy in the data to gain more efficiency. For (semi-)dense reconstruction on camera rigs which is prone to calibration inaccuracies, we show how to model and recover the rig calibration online in the reconstruction process. Finally, we explore the applicability of machine learning based on neural networks to the relative camera pose problem, focusing mainly on generating optimal training data. Robust and fast 3D reconstruction of the environment is demanded in several currently emerging applications ranging from set scanning for movies and computer games over inside-out tracking based augmented reality devices to autonomous robots and drones as well as self-driving cars.Für die gemeinsame Rekonstruktion von 3D Szenengeometrie und Kamera-Posen aus Bildern oder Videos gibt es zwei grundsätzliche Ansätze: Auf der einen Seite kann eine aus wenigen Oberflächen-Punkten bestehende Rekonstruktion erstellt werden, indem einzelne wiedererkennbare Features, die zum selben 3D-Punkt der Szene gehören, aus Bildern extrahiert werden. Mit diesen Features können die Position der 3D-Punkte sowie die Posen der Kameras so bestimmt werden, dass sie die Positionen der Features in den Bildern bestmöglich erklären. Auf der anderen Seite können bei Videos dichter gesampelte Oberflächen rekonstruiert werden, indem für jedes Einzelbild zuerst die Kamera-Pose bestimmt und dann die Szenengeometrie, die als Tiefenkarte vorhanden ist, verbessert wird. In dieser Dissertation werden verschiedene Verbesserungen für beide Rekonstruktionsstrategien vorgestellt. Wir beginnen damit, die Zuverlässigkeit beim Finden von Bildfeature-Paaren zu erhöhen, was zu einer robusteren und schnelleren Verarbeitung in den weiteren Rekonstruktionsschritten führt. Außerdem präsentieren wir eine Rekonstruktions-Pipeline für die Feature-basierte Rekonstruktion, die auf hohe Auflösungen und Bildwiederholraten optimiert ist und die Redundanz in entsprechenden Daten für eine effizientere Verarbeitung ausnutzt. Für die dichte Rekonstruktion von Oberflächen mit Multi-Kamera-Rigs, welche anfällig für Kalibrierungsungenauigkeiten ist, beschreiben wir, wie die Posen der Kameras innerhalb des Rigs modelliert und im Rekonstruktionsprozess laufend bestimmt werden können. Schließlich untersuchen wir die Anwendbarkeit von maschinellem Lernen basierend auf neuralen Netzen auf das Problem der Bestimmung der relativen Kamera-Pose. Unser Hauptaugenmerk liegt dabei auf dem Generieren möglichst optimaler Trainingsdaten. Eine robuste und schnelle 3D-Rekonstruktion der Umgebung wird in vielen zur Zeit aufstrebenden Anwendungsgebieten benötigt: Beim Erzeugen virtueller Abbilder realer Umgebungen für Filme und Computerspiele, bei inside-out Tracking basierten Augmented Reality Geräten, für autonome Roboter und Drohnen sowie bei selbstfahrenden Autos

    Contributions to shared control and coordination of single and multiple robots

    Get PDF
    L’ensemble des travaux présentés dans cette habilitation traite de l'interface entre un d'un opérateur humain avec un ou plusieurs robots semi-autonomes aussi connu comme le problème du « contrôle partagé ».Le premier chapitre traite de la possibilité de fournir des repères visuels / vestibulaires à un opérateur humain pour la commande à distance de robots mobiles.Le second chapitre aborde le problème, plus classique, de la mise à disposition à l’opérateur d’indices visuels ou de retour haptique pour la commande d’un ou plusieurs robots mobiles (en particulier pour les drones quadri-rotors).Le troisième chapitre se concentre sur certains des défis algorithmiques rencontrés lors de l'élaboration de techniques de coordination multi-robots.Le quatrième chapitre introduit une nouvelle conception mécanique pour un drone quadrirotor sur-actionné avec pour objectif de pouvoir, à terme, avoir 6 degrés de liberté sur une plateforme quadrirotor classique (mais sous-actionné).Enfin, le cinquième chapitre présente une cadre général pour la vision active permettant, en optimisant les mouvements de la caméra, l’optimisation en ligne des performances (en terme de vitesse de convergence et de précision finale) de processus d’estimation « basés vision »

    Visual motion estimation and tracking of rigid bodies by physical simulation

    Get PDF
    This thesis applies knowledge of the physical dynamics of objects to estimating object motion from vision when estimation from vision alone fails. It differentiates itself from existing physics-based vision by building in robustness to situations where existing visual estimation tends to fail: fast motion, blur, glare, distractors, and partial or full occlusion. A real-time physics simulator is incorporated into a stochastic framework by adding several different models of how noise is injected into the dynamics. Several different algorithms are proposed and experimentally validated on two problems: motion estimation and object tracking. The performance of visual motion estimation from colour histograms of a ball moving in two dimensions is improved considerably when a physics simulator is integrated into a MAP procedure involving non-linear optimisation and RANSAC-like methods. Process noise or initial condition noise in conjunction with a physics-based dynamics results in improved robustness on hard visual problems. A particle filter applied to the task of full 6D visual tracking of the pose an object being pushed by a robot in a table-top environment is improved on difficult visual problems by incorporating a simulator as a dynamics model and injecting noise as forces into the simulator.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Aerial Vehicles

    Get PDF
    This book contains 35 chapters written by experts in developing techniques for making aerial vehicles more intelligent, more reliable, more flexible in use, and safer in operation.It will also serve as an inspiration for further improvement of the design and application of aeral vehicles. The advanced techniques and research described here may also be applicable to other high-tech areas such as robotics, avionics, vetronics, and space
    corecore