5 research outputs found

    PatchMatch Belief Propagation for Correspondence Field Estimation and its Applications

    Get PDF
    Correspondence fields estimation is an important process that lies at the core of many different applications. Is it often seen as an energy minimisation problem, which is usually decomposed into the combined minimisation of two energy terms. The first is the unary energy, or data term, which reflects how well the solution agrees with the data. The second is the pairwise energy, or smoothness term, and ensures that the solution displays a certain level of smoothness, which is crucial for many applications. This thesis explores the possibility of combining two well-established algorithms for correspondence field estimation, PatchMatch and Belief Propagation, in order to benefit from the strengths of both and overcome some of their weaknesses. Belief Propagation is a common algorithm that can be used to optimise energies comprising both unary and pairwise terms. It is however computational expensive and thus not adapted to continuous spaces which are often needed in imaging applications. On the other hand, PatchMatch is a simple, yet very efficient method for optimising the unary energy of such problems on continuous and high dimensional spaces. The algorithm has two main components: the update of the solution space by sampling and the use of the spatial neighbourhood to propagate samples. We show how these components are related to the components of a specific form of Belief Propagation, called Particle Belief Propagation (PBP). PatchMatch however suffers from the lack of an explicit smoothness term. We show that unifying the two approaches yields a new algorithm, PMBP, which has improved performance compared to PatchMatch and is orders of magnitude faster than PBP. We apply our new optimiser to two different applications: stereo matching and optical flow. We validate the benefits of PMBP through series of experiments and show that we consistently obtain lower errors than both PatchMatch and Belief Propagation

    Combining Features and Semantics for Low-level Computer Vision

    Get PDF
    Visual perception of depth and motion plays a significant role in understanding and navigating the environment. Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving. The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information. Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions. Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects. Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem VerstĂ€ndnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die SchĂ€tzung der Bewegung von Videokameras sind von grĂ¶ĂŸter Bedeutung fĂŒr Anwendungen, wie das autonome Fahren. Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfĂŒr sind reflektierende und texturlose OberflĂ€chen oder große Bewegungen, bei denen herkömmliche lokale Methoden hĂ€ufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige LichtverhĂ€ltnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern. FĂŒr die binokulare Stereo SchĂ€tzung schlagen wir zuallererst vor zusammenhĂ€ngende Bereiche mit objektklassen-spezifischen DisparitĂ€ts VorschlĂ€gen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spĂ€rlichen DisparitĂ€tsschĂ€tzung und semantischen Segmentierung des Bildes erhalten. Die DisparitĂ€ts VorschlĂ€ge kodieren die Tatsache, dass die GegenstĂ€nde bestimmter Kategorien nicht willkĂŒrlich geformt sind, sondern typischerweise regelmĂ€ĂŸige Strukturen aufweisen. Wir integrieren sie fĂŒr die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen. Zweitens nutzen wir fĂŒr die 3D-Rekonstruktion die Tatsache, dass mit der GrĂ¶ĂŸe der rekonstruierten FlĂ€che auch die Wahrscheinlichkeit steigt, Objekte von Ă€hnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders fĂŒr Szenen im Freien, in denen GebĂ€ude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber Ă€hnlichkeit in der Form aufweisen. Wir nutzen diese Ă€hnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, wĂ€hrend fehlende FlĂ€chen vervollstĂ€ndigt werden, da Objekte Ă€hnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstĂ€dtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen AbhĂ€ngigkeiten zwischen Objekten. Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, prĂ€sentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusĂ€tzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. FĂŒr das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm fĂŒr eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes fĂŒr das Feature Matching

    Pixel- and Frame-level Video Labeling using Spatial and Temporal Convolutional Networks

    Get PDF
    This dissertation addresses the problem of video labeling at both the frame and pixel levels using deep learning. For pixel-level video labeling, we have studied two problems: i) Spatiotemporal video segmentation and ii) Boundary detection and boundary flow estimation. For the problem of spatiotemporal video segmentation, we have developed recurrent temporal deep field (RTDF). RTDF is a conditional random field (CRF) that combines a deconvolution neural network and a recurrent temporal restricted Boltzmann machine (RTRBM), which can be jointly trained end-to-end. We have derived a mean- field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. For the problem of boundary detection and boundary flow estimation, we have proposed a fully convolutional Siamese network (FCSN). The FCSN first estimates object boundaries in two consecutive frames, and then predicts boundary correspondences in the two frames. For frame-level video labeling, we have specified a temporal deformable residual network (TDRN) for temporal action segmentation. TDRN computes two parallel tem- poral processes: i) Residual stream that analyzes video information at its full temporal resolution, and ii) Pooling/unpooling stream that captures long-range visual cues. The former facilitates local, fine-scale action segmentation, and the latter uses multiscale context for improving the accuracy of frame classification. All of our networks have been empirically evaluated on challenging benchmark datasets and compared with the state of the art. Each of the above approaches has outperformed the state of the art at the time of our evaluation

    Model-Based Environmental Visual Perception for Humanoid Robots

    Get PDF
    The visual perception of a robot should answer two fundamental questions: What? and Where? In order to properly and efficiently reply to these questions, it is essential to establish a bidirectional coupling between the external stimuli and the internal representations. This coupling links the physical world with the inner abstraction models by sensor transformation, recognition, matching and optimization algorithms. The objective of this PhD is to establish this sensor-model coupling
    corecore