1,213 research outputs found

    Blending Learning and Inference in Structured Prediction

    Full text link
    In this paper we derive an efficient algorithm to learn the parameters of structured predictors in general graphical models. This algorithm blends the learning and inference tasks, which results in a significant speedup over traditional approaches, such as conditional random fields and structured support vector machines. For this purpose we utilize the structures of the predictors to describe a low dimensional structured prediction task which encourages local consistencies within the different structures while learning the parameters of the model. Convexity of the learning task provides the means to enforce the consistencies between the different parts. The inference-learning blending algorithm that we propose is guaranteed to converge to the optimum of the low dimensional primal and dual programs. Unlike many of the existing approaches, the inference-learning blending allows us to learn efficiently high-order graphical models, over regions of any size, and very large number of parameters. We demonstrate the effectiveness of our approach, while presenting state-of-the-art results in stereo estimation, semantic segmentation, shape reconstruction, and indoor scene understanding

    Combining Features and Semantics for Low-level Computer Vision

    Get PDF
    Visual perception of depth and motion plays a significant role in understanding and navigating the environment. Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving. The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information. Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions. Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects. Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem Verständnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die Schätzung der Bewegung von Videokameras sind von größter Bedeutung für Anwendungen, wie das autonome Fahren. Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfür sind reflektierende und texturlose Oberflächen oder große Bewegungen, bei denen herkömmliche lokale Methoden häufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige Lichtverhältnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern. Für die binokulare Stereo Schätzung schlagen wir zuallererst vor zusammenhängende Bereiche mit objektklassen-spezifischen Disparitäts Vorschlägen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spärlichen Disparitätsschätzung und semantischen Segmentierung des Bildes erhalten. Die Disparitäts Vorschläge kodieren die Tatsache, dass die Gegenstände bestimmter Kategorien nicht willkürlich geformt sind, sondern typischerweise regelmäßige Strukturen aufweisen. Wir integrieren sie für die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen. Zweitens nutzen wir für die 3D-Rekonstruktion die Tatsache, dass mit der Größe der rekonstruierten Fläche auch die Wahrscheinlichkeit steigt, Objekte von ähnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders für Szenen im Freien, in denen Gebäude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber ähnlichkeit in der Form aufweisen. Wir nutzen diese ähnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, während fehlende Flächen vervollständigt werden, da Objekte ähnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstädtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen Abhängigkeiten zwischen Objekten. Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, präsentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusätzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. Für das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm für eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes für das Feature Matching

    Deep learning based RGB-D vision tasks

    Get PDF
    Depth is an important source of information in computer vision. However, depth is usually discarded in most vision tasks. In this thesis, we study the tasks of estimating depth from single monocular images, and incorporating depth for object detection and semantic segmentation. Recently, a significant number of breakthroughs have been introduced to the vision community by deep convolutional neural networks (CNNs). All of our algorithms in this thesis are built upon deep CNNs. The first part of this thesis addresses the task of incorporating depth for object detection and semantic segmentation. The aim is to improve the performance of vision tasks that are only based on RGB data. Two approaches for object detection and two approaches for semantic segmentation are presented. These approaches are based on existing depth estimation, object detection and semantic segmentation algorithms. The second part of this thesis addresses the task of depth estimation. Depth estimation is often formulated as a regression task due to the continuous property of depths. Deep CNNs for depth estimation are trained by iteratively minimizing regression errors between predicted and ground-truth depths. A drawback of regression is that it predicts depths without confidence. In this thesis, we propose to formulate depth estimation as a classification task which naturally predicts depths with confidence. The confidence can be used during training and post-processing. We also propose to exploit ordinal depth relationships from stereo videos to improve the performance of metric depth estimation. By doing so we propose a Relative Depth in Stereo (RDIS) dataset that is densely annotated with relative depths.Thesis (Ph.D.) -- University of Adelaide,School of Computer Science , 201

    Layered Interpretation of Street View Images

    Full text link
    We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers encode semantic classes like ground, pedestrians, vehicles, buildings, and sky in addition to the depths. The only input to our algorithm is a pair of stereo images. We use a deep neural network to extract the appearance features for semantic classes. We use a simple and an efficient inference algorithm to jointly estimate both semantic classes and layered depth values. Our method outperforms other competing approaches in Daimler urban scene segmentation dataset. Our algorithm is massively parallelizable, allowing a GPU implementation with a processing speed about 9 fps.Comment: The paper will be presented in the 2015 Robotics: Science and Systems Conference (RSS

    Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

    Get PDF
    Simultaneous Localization and Mapping (SLAM)consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors' take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved

    Deep learning in remote sensing: a review

    Get PDF
    Standing at the paradigm shift towards data-intensive science, machine learning techniques are becoming increasingly important. In particular, as a major breakthrough in the field, deep learning has proven as an extremely powerful tool in many fields. Shall we embrace deep learning as the key to all? Or, should we resist a 'black-box' solution? There are controversial opinions in the remote sensing community. In this article, we analyze the challenges of using deep learning for remote sensing data analysis, review the recent advances, and provide resources to make deep learning in remote sensing ridiculously simple to start with. More importantly, we advocate remote sensing scientists to bring their expertise into deep learning, and use it as an implicit general model to tackle unprecedented large-scale influential challenges, such as climate change and urbanization.Comment: Accepted for publication IEEE Geoscience and Remote Sensing Magazin

    Vision-based estimation of altitude from aerial images

    Get PDF
    One of the wide engineering fields is aircraft technologies and one of the most common needs for Airplane or UAV is estimating the altitude, which is some time difficult to estimate due to weather fluctuations and instability of the main parameters like pressure and speed. However, a combination of different sensors has been used to estimate altitude to guarantee an accurate reading and it is the method used these days. To overcome this problem is to use more capable technology such as machine vision based system to estimate the altitude, as advantages light weight, intelligence and accuracy, cheaper than commercial sensors as well as, computationally inexpensive. In this paper, we propose a vision-based system that can perform altitude estimation from aerial images. The satisfactory experimental results demonstrate the effectiveness of the proposed system

    A factorization approach to inertial affine structure from motion

    Full text link
    We consider the problem of reconstructing a 3-D scene from a moving camera with high frame rate using the affine projection model. This problem is traditionally known as Affine Structure from Motion (Affine SfM), and can be solved using an elegant low-rank factorization formulation. In this paper, we assume that an accelerometer and gyro are rigidly mounted with the camera, so that synchronized linear acceleration and angular velocity measurements are available together with the image measurements. We extend the standard Affine SfM algorithm to integrate these measurements through the use of image derivatives
    • …
    corecore