571 research outputs found
Encoding and estimation of first-and second-order binocular disparity in natural images
Research supported by BBSRC Grant Nos. BB/G004803/1 (RG) and BB/K018973/1 (PH/DH).The first stage of processing of binocular information in the visual cortex is performed by mechanisms that are bandpass-tuned for spatial frequency and orientation. Psychophysical and physiological evidence have also demonstrated the existence of second-order mechanisms in binocular processing, which can encode disparities that are not directly accessible to first-order mechanisms. We compared the responses of first- and second-order binocular filters to natural images. We found that the responses of the second-order mechanisms are to some extent correlated with the responses of the first-order mechanisms, and that they can contribute to increasing both the accuracy, and depth range, of binocular stereopsis.Publisher PDFPeer reviewe
The spatial averaging of disparities in brief, static random-dot stereograms
Visual images from the two eyes are transmitted to the brain. Because the eyes are horizontally separated, there is a horizontal disparity between the two images. The amount of disparity between the images of a given point depends on the distance of that point from the viewer's point of fixation. A natural visual environment contains surfaces at many different depths. Therefore, the brain must process a spatial distribution of disparities. How are these disparities spatially put together? Brief (about 200 msec) static Cyclopean random-dot stereograms were used as stimuli for vergence and depth discrimination to answer this question. The results indicated a large averaging region for vergence, and a smaller pooling region for depth discrimination. Vergence responded to the mean disparity of two transparent planes. When a disparate target was present in a fixation plane surround, vergence improved as target size was increased, with a saturation at 3-6 degrees. Depth discrimination thresholds improved with target size, reaching a minimum at 1-3 degrees, but increased for larger targets. Depth discrimination showed a dependence on the extent of a disparity pedestal surrounding the target, consistent with vergence facilitation. Vergence might, therefore, implement a coarse-to-fine reduction in binocular matching noise. Interocular decorrelation can be considered as multiple chance matches at different disparities. The spatial pooling limits found for disparity were replicated when interocular decorrelation was discriminated. The disparity of the random dots also influenced the apparent horizontal. alignment of neighbouring monocular lines. This finding suggests that disparity averaging takes place at an early stage of visual processing. The following possible explanations were considered: 1) Disparities are detected in different spatial frequency channels (Marr and Poggio, 1979). 2) Second-order luminance patterns are matched between the two eyes using non-linear channels. 3) Secondary disparity filters process disparities extracted from linear filters
Combining Features and Semantics for Low-level Computer Vision
Visual perception of depth and motion plays a significant role in understanding and navigating the environment.
Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving.
The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information.
Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions.
Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects.
Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem Verständnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die Schätzung der Bewegung von Videokameras sind von größter Bedeutung für Anwendungen, wie das autonome Fahren.
Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfür sind reflektierende und texturlose Oberflächen oder große Bewegungen, bei denen herkömmliche lokale Methoden häufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige Lichtverhältnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern.
Für die binokulare Stereo Schätzung schlagen wir zuallererst vor zusammenhängende Bereiche mit objektklassen-spezifischen Disparitäts Vorschlägen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spärlichen Disparitätsschätzung und semantischen Segmentierung des Bildes erhalten. Die Disparitäts Vorschläge kodieren die Tatsache, dass die Gegenstände bestimmter Kategorien nicht willkürlich geformt sind, sondern typischerweise regelmäßige Strukturen aufweisen. Wir integrieren sie für die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen.
Zweitens nutzen wir für die 3D-Rekonstruktion die Tatsache, dass mit der Größe der rekonstruierten Fläche auch die Wahrscheinlichkeit steigt, Objekte von ähnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders für Szenen im Freien, in denen Gebäude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber ähnlichkeit in der Form aufweisen. Wir nutzen diese ähnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, während fehlende Flächen vervollständigt werden, da Objekte ähnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstädtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen Abhängigkeiten zwischen Objekten.
Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, präsentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusätzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. Für das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm für eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes für das Feature Matching
3D Least Squares Based Surface Reconstruction
Diese Arbeit präsentiert einen vollständig dreidimensionalen (3D) Algorithmus zur Oberflächenrekonstruktion aus Bildfolgen mit großer Basis. Die rekonstruierten Oberflächen werden durch Dreiecksgitter beschrieben, was eine einfache Integration von Bild- und Geometrie-basierten Bedingungen ermöglicht. Die vorgestellte Arbeit erweitert den erfolgreichen Ansatz von Heipke (1990) zur 2,5D Rekonstruktion zur vollständigen 3D Rekonstruktion. Verdeckung und nicht-Lambertsche Spiegelung werden durch robuste kleinste Quadrate Ausgleichung zur Schätzung des Modells berücksichtigt. Ausgangsdaten sind Bilder von verschiedenen Positionen, abgeleitete genaue Orientierungen der Bilder und eine begrenzte Zahl von 3D Punkten (Bartelsen and Mayer 2010). Die erste Neuerung des vorgestellten Ansatzes besteht in der Art und Weise, wie zusätzliche Punkte (Unbekannte) in dem Dreiecksgitter aus den vorgegebenen 3D Punkten positioniert werden. Dank den genauen Positionen dieser zusätzlichen Punkte werden präzisere und genauere rekonstruierte Oberflächen bezüglich Form und Anpassung der Bildtextur erhalten. Die zweite Neuerung besteht darin, dass individuelle Bias-Parameter für verschiedene Bilder und angepasste Gewichtungen für unterschiedliche Bildbeobachtungen verwendet werden, um damit unterschiedliche Intensitäten verschiedener Bilder als auch Ausreißer zu berücksichtigen. Die dritte Neuerung sind die verwendete Faktorisierung der Design-Matrix und die Art und Weise, wie die Gitter in Ebenen zerlegt werden, um die Laufzeit zu reduzieren. Das wesentliche Element des vorgestellten Modells besteht in der Varianz der Intensitätswerte der Bildbeobachtungen innerhalb eines Dreiecks. Mit dem vorgestellten Ansatz können genaue 3D Oberflächen für unterschiedliche Arten von Szenen rekonstruiert werden. Ergebnisse werden als VRML (Virtual Reality Modeling Language) Modelle ausgegeben, welche sowohl das Potential als auch die derzeitigen Grenzen des Ansatzes aufzeigen.This thesis presents a fully three dimensional (3D) surface reconstruction algorithm from wide-baseline image sequences. Triangle meshes represent the reconstructed surfaces allowing for an easy integration of image- and geometry-based constraints. We extend the successful approach for 2.5D reconstruction of Heipke (1990) to full 3D. To take into account occlusion and non-Lambertian reflection, we apply robust least squares adjustment to estimate the model. The input for our approach are images taken from different positions and derived accurate image orientations as well as sparse 3D points (Bartelsen and Mayer 2010). The first novelty of our approach is the way we position additional 3D points (unknowns) in the triangle meshes constructed from given 3D points. Owing to the precise positions of these additional 3D points, we obtain more precise and accurate reconstructed surfaces in terms of shape and fit of texture. The second novelty is to apply individual bias parameters for different images and adapted weights for different image observations to account for differences in the intensity values for different images as well as to consider outliers in the estimation. The third novelty is the way we factorize the design matrix and divide the meshes into layers to reduce the run time. The essential element for our model is the variance of the intensity values of image observations inside a triangle. Applying the approach, we can reconstruct accurate 3D surfaces for different types of scenes. Results are presented in the form of VRML (Virtual Reality Modeling Language) models, demonstrating the potential of the approach as well as its current shortcomings
Recommended from our members
High-quality dense stereo vision for whole body imaging and obesity assessment
textThe prevalence of obesity has necessitated developing safe and convenient tools for timely assessing and monitoring this condition for a broad range of population. Three-dimensional (3D) body imaging has become a new mean for obesity assessment. Moreover, it generates body shape information that is meaningful for fitness, ergonomics, and personalized clothing. In the previous work of our lab, we developed a prototype active stereo vision system that demonstrated a potential to fulfill this goal. But the prototype required four computer projectors to cast artificial textures on the body which facilitate the stereo-matching on texture-deficient images (e.g., skin). This decreases the mobility of the system when used to collect a large population data. In addition, the resolution of the generated 3D~images is limited by both cameras and projectors available during the project. The study reported in this dissertation highlights our continued effort in improving the capability of 3Dbody imaging through simplified hardware for passive stereo and advanced computation techniques.
The system utilizes high-resolution single-lens reflex (SLR) cameras, which became widely available lately, and is configured in a two-stance design to image the front and back surfaces of a person. A total of eight cameras are used to form four pairs of stereo units. Each unit covers a quarter of the body surface. The stereo units are individually calibrated with a specific pattern to determine cameras' intrinsic and extrinsic parameters for stereo matching. The global orientation and position of each stereo unit within a common world coordinate system is calculated through a 3Dregistration step. The stereo calibration and 3Dregistration procedures do not need to be repeated for a deployed system if the cameras' relative positions have not changed. This property contributes to the portability of the system, and tremendously alleviates the maintenance task. The image acquisition time is around two seconds for a whole-body capture. The system works in an indoor environment with a moderate ambient light.
Advanced stereo computation algorithms are developed by taking advantage of high-resolution images and by tackling the ambiguity problem in stereo matching. A multi-scale, coarse-to-fine matching framework is proposed to match large-scale textures at a low resolution and refine the matched results over higher resolutions. This matching strategy reduces the complexity of the computation and avoids ambiguous matching at the native resolution. The pixel-to-pixel stereo matching algorithm follows a classic, four-step strategy which consists of matching cost computation, cost aggregation, disparity computation and disparity refinement.
The system performance has been evaluated on mannequins and human subjects in comparison with other measurement methods. It was found that the geometrical measurements from reconstructed 3Dbody models, including body circumferences and whole volume, are highly repeatable and consistent with manual and other instrumental measurements (CV 0.99). The agreement of percent body fat (%BF) estimation on human subjects between stereo and dual-energy X-ray absorptiometry (DEXA) was found to be improved over the previous active stereo system, and the limits of agreement with 95% confidence were reduced by half. Our achieved %BF estimation agreement is among the lowest ones of other comparative studies with commercialized air displacement plethysmography (ADP) and DEXA. In practice, %BF estimation through a two-component model is sensitive to body volume measurement, and the estimation of lung volume could be a source of variation. Protocols for this type of measurement should still be created with an awareness of this factor.Biomedical Engineerin
NOVEL DENSE STEREO ALGORITHMS FOR HIGH-QUALITY DEPTH ESTIMATION FROM IMAGES
This dissertation addresses the problem of inferring scene depth information from a collection of calibrated images taken from different viewpoints via stereo matching. Although it has been heavily investigated for decades, depth from stereo remains a long-standing challenge and popular research topic for several reasons. First of all, in order to be of practical use for many real-time applications such as autonomous driving, accurate depth estimation in real-time is of great importance and one of the core challenges in stereo. Second, for applications such as 3D reconstruction and view synthesis, high-quality depth estimation is crucial to achieve photo realistic results. However, due to the matching ambiguities, accurate dense depth estimates are difficult to achieve. Last but not least, most stereo algorithms rely on identification of corresponding points among images and only work effectively when scenes are Lambertian. For non-Lambertian surfaces, the brightness constancy assumption is no longer valid. This dissertation contributes three novel stereo algorithms that are motivated by the specific requirements and limitations imposed by different applications.
In addressing high speed depth estimation from images, we present a stereo algorithm that achieves high quality results while maintaining real-time performance. We introduce an adaptive aggregation step in a dynamic-programming framework. Matching costs are aggregated in the vertical direction using a computationally expensive weighting scheme based on color and distance proximity. We utilize the vector processing capability and parallelism in commodity graphics hardware to speed up this process over two orders of magnitude.
In addressing high accuracy depth estimation, we present a stereo model that makes use of constraints from points with known depths - the Ground Control Points (GCPs) as referred to in stereo literature. Our formulation explicitly models the influences of GCPs in a Markov Random Field. A novel regularization prior is naturally integrated into a global inference framework in a principled way using the Bayes rule. Our probabilistic framework allows GCPs to be obtained from various modalities and provides a natural way to integrate information from various sensors.
In addressing non-Lambertian reflectance, we introduce a new invariant for stereo correspondence which allows completely arbitrary scene reflectance (bidirectional reflectance distribution functions - BRDFs). This invariant can be used to formulate a rank constraint on stereo matching when the scene is observed by several lighting configurations in which only the lighting intensity varies
Recommended from our members
View synthesis for kinetic depth X-ray imaging
This thesis reports the development and analysis of feature based synthesis of transmission X-ray images. The synthetic imagery is formed through matching and morphing or warping line-scan format images produced by a novel multi-view X-ray machine. In this way video type sequences, which periodically alternate between synthetic and detector based views, may be formed. The purpose of these sequences is to provide depth from motion or kinetic depth effect (KDE) in a visual display; while the role of the synthesis is to reduce the total number of detector arrays, associated collimators and X-ray flux per inspection. A specific challenge is to explore the bounds for producing synthetic imagery that can be seamlessly introduced into the resultant sequences. This work is distinct from the image collection and display technique, termed KDEX, previously undertaken by the Imaging Science Group at NTU. The ultimate aim of the research programme in collaboration with The UK Home Office and The US Dept. of Homeland Security is to enhance the detection and identification of threats in X-ray scans of luggage. A multi-view „KDEX scanner‟ was employed to collect greyscale and colour coded image sequences of 30 different bags; each sequence comprised of 7 perspective views separated from one another by 10. This imagery was organised and stored in a database to enable a coherent series of experiments to be conducted. Corresponding features in sequential pairs of images, at various different angular separations, were identified by applying a scale invariant feature transform (SIFT)
Single View Modeling and View Synthesis
This thesis develops new algorithms to produce 3D content from a single camera. Today, amateurs can use hand-held camcorders to capture and display the 3D world in 2D, using mature technologies. However, there is always a strong desire to record and re-explore the 3D world in 3D. To achieve this goal, current approaches usually make use of a camera array, which suffers from tedious setup and calibration processes, as well as lack of portability, limiting its application to lab experiments.
In this thesis, I try to produce the 3D contents using a single camera, making it as simple as shooting pictures. It requires a new front end capturing device rather than a regular camcorder, as well as more sophisticated algorithms. First, in order to capture the highly detailed object surfaces, I designed and developed a depth camera based on a novel technique called light fall-off stereo (LFS). The LFS depth camera outputs color+depth image sequences and achieves 30 fps, which is necessary for capturing dynamic scenes. Based on the output color+depth images, I developed a new approach that builds 3D models of dynamic and deformable objects. While the camera can only capture part of a whole object at any instance, partial surfaces are assembled together to form a complete 3D model by a novel warping algorithm.
Inspired by the success of single view 3D modeling, I extended my exploration into 2D-3D video conversion that does not utilize a depth camera. I developed a semi-automatic system that converts monocular videos into stereoscopic videos, via view synthesis. It combines motion analysis with user interaction, aiming to transfer as much depth inferring work from the user to the computer. I developed two new methods that analyze the optical flow in order to provide additional qualitative depth constraints. The automatically extracted depth information is presented in the user interface to assist with user labeling work.
In this thesis, I developed new algorithms to produce 3D contents from a single camera. Depending on the input data, my algorithm can build high fidelity 3D models for dynamic and deformable objects if depth maps are provided. Otherwise, it can turn the video clips into stereoscopic video
- …