571 research outputs found

    Encoding and estimation of first-and second-order binocular disparity in natural images

    Get PDF
    Research supported by BBSRC Grant Nos. BB/G004803/1 (RG) and BB/K018973/1 (PH/DH).The first stage of processing of binocular information in the visual cortex is performed by mechanisms that are bandpass-tuned for spatial frequency and orientation. Psychophysical and physiological evidence have also demonstrated the existence of second-order mechanisms in binocular processing, which can encode disparities that are not directly accessible to first-order mechanisms. We compared the responses of first- and second-order binocular filters to natural images. We found that the responses of the second-order mechanisms are to some extent correlated with the responses of the first-order mechanisms, and that they can contribute to increasing both the accuracy, and depth range, of binocular stereopsis.Publisher PDFPeer reviewe

    The spatial averaging of disparities in brief, static random-dot stereograms

    Get PDF
    Visual images from the two eyes are transmitted to the brain. Because the eyes are horizontally separated, there is a horizontal disparity between the two images. The amount of disparity between the images of a given point depends on the distance of that point from the viewer's point of fixation. A natural visual environment contains surfaces at many different depths. Therefore, the brain must process a spatial distribution of disparities. How are these disparities spatially put together? Brief (about 200 msec) static Cyclopean random-dot stereograms were used as stimuli for vergence and depth discrimination to answer this question. The results indicated a large averaging region for vergence, and a smaller pooling region for depth discrimination. Vergence responded to the mean disparity of two transparent planes. When a disparate target was present in a fixation plane surround, vergence improved as target size was increased, with a saturation at 3-6 degrees. Depth discrimination thresholds improved with target size, reaching a minimum at 1-3 degrees, but increased for larger targets. Depth discrimination showed a dependence on the extent of a disparity pedestal surrounding the target, consistent with vergence facilitation. Vergence might, therefore, implement a coarse-to-fine reduction in binocular matching noise. Interocular decorrelation can be considered as multiple chance matches at different disparities. The spatial pooling limits found for disparity were replicated when interocular decorrelation was discriminated. The disparity of the random dots also influenced the apparent horizontal. alignment of neighbouring monocular lines. This finding suggests that disparity averaging takes place at an early stage of visual processing. The following possible explanations were considered: 1) Disparities are detected in different spatial frequency channels (Marr and Poggio, 1979). 2) Second-order luminance patterns are matched between the two eyes using non-linear channels. 3) Secondary disparity filters process disparities extracted from linear filters

    Combining Features and Semantics for Low-level Computer Vision

    Get PDF
    Visual perception of depth and motion plays a significant role in understanding and navigating the environment. Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving. The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information. Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions. Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects. Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem Verständnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die Schätzung der Bewegung von Videokameras sind von größter Bedeutung für Anwendungen, wie das autonome Fahren. Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfür sind reflektierende und texturlose Oberflächen oder große Bewegungen, bei denen herkömmliche lokale Methoden häufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige Lichtverhältnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern. Für die binokulare Stereo Schätzung schlagen wir zuallererst vor zusammenhängende Bereiche mit objektklassen-spezifischen Disparitäts Vorschlägen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spärlichen Disparitätsschätzung und semantischen Segmentierung des Bildes erhalten. Die Disparitäts Vorschläge kodieren die Tatsache, dass die Gegenstände bestimmter Kategorien nicht willkürlich geformt sind, sondern typischerweise regelmäßige Strukturen aufweisen. Wir integrieren sie für die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen. Zweitens nutzen wir für die 3D-Rekonstruktion die Tatsache, dass mit der Größe der rekonstruierten Fläche auch die Wahrscheinlichkeit steigt, Objekte von ähnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders für Szenen im Freien, in denen Gebäude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber ähnlichkeit in der Form aufweisen. Wir nutzen diese ähnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, während fehlende Flächen vervollständigt werden, da Objekte ähnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstädtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen Abhängigkeiten zwischen Objekten. Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, präsentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusätzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. Für das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm für eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes für das Feature Matching

    3D Least Squares Based Surface Reconstruction

    Get PDF
    Diese Arbeit präsentiert einen vollständig dreidimensionalen (3D) Algorithmus zur Oberflächenrekonstruktion aus Bildfolgen mit großer Basis. Die rekonstruierten Oberflächen werden durch Dreiecksgitter beschrieben, was eine einfache Integration von Bild- und Geometrie-basierten Bedingungen ermöglicht. Die vorgestellte Arbeit erweitert den erfolgreichen Ansatz von Heipke (1990) zur 2,5D Rekonstruktion zur vollständigen 3D Rekonstruktion. Verdeckung und nicht-Lambertsche Spiegelung werden durch robuste kleinste Quadrate Ausgleichung zur Schätzung des Modells berücksichtigt. Ausgangsdaten sind Bilder von verschiedenen Positionen, abgeleitete genaue Orientierungen der Bilder und eine begrenzte Zahl von 3D Punkten (Bartelsen and Mayer 2010). Die erste Neuerung des vorgestellten Ansatzes besteht in der Art und Weise, wie zusätzliche Punkte (Unbekannte) in dem Dreiecksgitter aus den vorgegebenen 3D Punkten positioniert werden. Dank den genauen Positionen dieser zusätzlichen Punkte werden präzisere und genauere rekonstruierte Oberflächen bezüglich Form und Anpassung der Bildtextur erhalten. Die zweite Neuerung besteht darin, dass individuelle Bias-Parameter für verschiedene Bilder und angepasste Gewichtungen für unterschiedliche Bildbeobachtungen verwendet werden, um damit unterschiedliche Intensitäten verschiedener Bilder als auch Ausreißer zu berücksichtigen. Die dritte Neuerung sind die verwendete Faktorisierung der Design-Matrix und die Art und Weise, wie die Gitter in Ebenen zerlegt werden, um die Laufzeit zu reduzieren. Das wesentliche Element des vorgestellten Modells besteht in der Varianz der Intensitätswerte der Bildbeobachtungen innerhalb eines Dreiecks. Mit dem vorgestellten Ansatz können genaue 3D Oberflächen für unterschiedliche Arten von Szenen rekonstruiert werden. Ergebnisse werden als VRML (Virtual Reality Modeling Language) Modelle ausgegeben, welche sowohl das Potential als auch die derzeitigen Grenzen des Ansatzes aufzeigen.This thesis presents a fully three dimensional (3D) surface reconstruction algorithm from wide-baseline image sequences. Triangle meshes represent the reconstructed surfaces allowing for an easy integration of image- and geometry-based constraints. We extend the successful approach for 2.5D reconstruction of Heipke (1990) to full 3D. To take into account occlusion and non-Lambertian reflection, we apply robust least squares adjustment to estimate the model. The input for our approach are images taken from different positions and derived accurate image orientations as well as sparse 3D points (Bartelsen and Mayer 2010). The first novelty of our approach is the way we position additional 3D points (unknowns) in the triangle meshes constructed from given 3D points. Owing to the precise positions of these additional 3D points, we obtain more precise and accurate reconstructed surfaces in terms of shape and fit of texture. The second novelty is to apply individual bias parameters for different images and adapted weights for different image observations to account for differences in the intensity values for different images as well as to consider outliers in the estimation. The third novelty is the way we factorize the design matrix and divide the meshes into layers to reduce the run time. The essential element for our model is the variance of the intensity values of image observations inside a triangle. Applying the approach, we can reconstruct accurate 3D surfaces for different types of scenes. Results are presented in the form of VRML (Virtual Reality Modeling Language) models, demonstrating the potential of the approach as well as its current shortcomings

    NOVEL DENSE STEREO ALGORITHMS FOR HIGH-QUALITY DEPTH ESTIMATION FROM IMAGES

    Get PDF
    This dissertation addresses the problem of inferring scene depth information from a collection of calibrated images taken from different viewpoints via stereo matching. Although it has been heavily investigated for decades, depth from stereo remains a long-standing challenge and popular research topic for several reasons. First of all, in order to be of practical use for many real-time applications such as autonomous driving, accurate depth estimation in real-time is of great importance and one of the core challenges in stereo. Second, for applications such as 3D reconstruction and view synthesis, high-quality depth estimation is crucial to achieve photo realistic results. However, due to the matching ambiguities, accurate dense depth estimates are difficult to achieve. Last but not least, most stereo algorithms rely on identification of corresponding points among images and only work effectively when scenes are Lambertian. For non-Lambertian surfaces, the brightness constancy assumption is no longer valid. This dissertation contributes three novel stereo algorithms that are motivated by the specific requirements and limitations imposed by different applications. In addressing high speed depth estimation from images, we present a stereo algorithm that achieves high quality results while maintaining real-time performance. We introduce an adaptive aggregation step in a dynamic-programming framework. Matching costs are aggregated in the vertical direction using a computationally expensive weighting scheme based on color and distance proximity. We utilize the vector processing capability and parallelism in commodity graphics hardware to speed up this process over two orders of magnitude. In addressing high accuracy depth estimation, we present a stereo model that makes use of constraints from points with known depths - the Ground Control Points (GCPs) as referred to in stereo literature. Our formulation explicitly models the influences of GCPs in a Markov Random Field. A novel regularization prior is naturally integrated into a global inference framework in a principled way using the Bayes rule. Our probabilistic framework allows GCPs to be obtained from various modalities and provides a natural way to integrate information from various sensors. In addressing non-Lambertian reflectance, we introduce a new invariant for stereo correspondence which allows completely arbitrary scene reflectance (bidirectional reflectance distribution functions - BRDFs). This invariant can be used to formulate a rank constraint on stereo matching when the scene is observed by several lighting configurations in which only the lighting intensity varies

    Single View Modeling and View Synthesis

    Get PDF
    This thesis develops new algorithms to produce 3D content from a single camera. Today, amateurs can use hand-held camcorders to capture and display the 3D world in 2D, using mature technologies. However, there is always a strong desire to record and re-explore the 3D world in 3D. To achieve this goal, current approaches usually make use of a camera array, which suffers from tedious setup and calibration processes, as well as lack of portability, limiting its application to lab experiments. In this thesis, I try to produce the 3D contents using a single camera, making it as simple as shooting pictures. It requires a new front end capturing device rather than a regular camcorder, as well as more sophisticated algorithms. First, in order to capture the highly detailed object surfaces, I designed and developed a depth camera based on a novel technique called light fall-off stereo (LFS). The LFS depth camera outputs color+depth image sequences and achieves 30 fps, which is necessary for capturing dynamic scenes. Based on the output color+depth images, I developed a new approach that builds 3D models of dynamic and deformable objects. While the camera can only capture part of a whole object at any instance, partial surfaces are assembled together to form a complete 3D model by a novel warping algorithm. Inspired by the success of single view 3D modeling, I extended my exploration into 2D-3D video conversion that does not utilize a depth camera. I developed a semi-automatic system that converts monocular videos into stereoscopic videos, via view synthesis. It combines motion analysis with user interaction, aiming to transfer as much depth inferring work from the user to the computer. I developed two new methods that analyze the optical flow in order to provide additional qualitative depth constraints. The automatically extracted depth information is presented in the user interface to assist with user labeling work. In this thesis, I developed new algorithms to produce 3D contents from a single camera. Depending on the input data, my algorithm can build high fidelity 3D models for dynamic and deformable objects if depth maps are provided. Otherwise, it can turn the video clips into stereoscopic video
    corecore