177 research outputs found
Fast GPU Accelerated Stereo Correspondence for Embedded Surveillance Camera Systems
Many surveillance applications could benefit from the use of stereo cam- eras for depth perception. While state-of-the-art methods provide high quality scene depth information, many of the methods are very time consuming and not suitable for real-time usage in limited embedded systems. This study was conducted to examine stereo correlation methods to find a suitable algorithm for real-time or near real-time depth perception through disparity maps in a stereo video surveillance camera with an embedded GPU. Moreover, novel refinements and alternations was investigated to further improve performance and quality. Quality tests were conducted in Octave while GPU suitability and performance tests were done in C++ with the OpenGL ES 2.0 library. The result is a local stereo correlation method using Normalized Cross Correlation together with sparse support windows and a suggested improvement for pixel-wise matching confidence. Applying sparse support windows increased frame rate by 35% with minimal quality penalty as compared to using full support windows
Cascaded Scene Flow Prediction using Semantic Segmentation
Given two consecutive frames from a pair of stereo cameras, 3D scene flow
methods simultaneously estimate the 3D geometry and motion of the observed
scene. Many existing approaches use superpixels for regularization, but may
predict inconsistent shapes and motions inside rigidly moving objects. We
instead assume that scenes consist of foreground objects rigidly moving in
front of a static background, and use semantic cues to produce pixel-accurate
scene flow estimates. Our cascaded classification framework accurately models
3D scenes by iteratively refining semantic segmentation masks, stereo
correspondences, 3D rigid motion estimates, and optical flow fields. We
evaluate our method on the challenging KITTI autonomous driving benchmark, and
show that accounting for the motion of segmented vehicles leads to
state-of-the-art performance.Comment: International Conference on 3D Vision (3DV), 2017 (oral presentation
Exploiting High Level Scene Cues in Stereo Reconstruction
We present a novel approach to 3D reconstruction which is inspired by the human visual system. This system unifies standard appearance matching and triangulation techniques with higher level reasoning and scene understanding, in order to resolve ambiguities between different interpretations of the scene. The types of reasoning integrated in the approach includes recognising common configurations of surface normals and semantic edges (e.g. convex, concave and occlusion boundaries). We also recognise the coplanar, collinear and symmetric structures which are especially common in man made environments
Classic Mosaics and Visual Correspondence via Graph-Cut based Energy Optimization
Computer graphics and computer vision were traditionally two distinct research fields focusing on opposite topics. Lately, they have been increasingly borrowing ideas and tools from each other. In this thesis, we investigate two problems in computer vision and graphics that rely on the same tool, namely energy optimization with graph cuts.
In the area of computer graphics, we address the problem of generating artificial classic mosaics, still and animated. The main purpose of artificial mosaics is to help a user to create digital art. First we reformulate our previous static mosaic work in a more principled global optimization framework. Then, relying on our still mosaic algorithm, we develop a method for producing animated mosaics directly from real video sequences, which is the first such method, we believe. Our mosaic animation style is uniquely expressive. Our method estimates the motion of the pixels in the video, renders the frames with mosaic effect based on both the colour and motion information from the input video. This algorithm relies extensively on our novel motion segmentation approach, which is a computer vision problem.
To improve the quality of our animated mosaics, we need to improve the motion segmentation algorithm. Since motion and stereo problems have a similar setup, we start with the problem of finding visual correspondence for stereo, which has the advantage of having datasets with ground truth, useful for evaluation. Most previous methods for stereo correspondence do not provide any measure of reliability in their estimates. We aim to find the regions for which correspondence can be determined reliably. Our main idea is to find corresponding regions that have a sufficiently strong texture cue on the boundary, since texture is a reliable cue for matching. Unlike the previous work, we allow the disparity range within each such region to vary smoothly, instead of being constant. This produces blob-like semi-dense visual features for which we have a high confidence in their estimated ranges of disparities
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
Video-based Pedestrian Intention Recognition and Path Prediction for Advanced Driver Assistance Systems
Fortgeschrittene Fahrerassistenzsysteme (FAS) spielen eine sehr wichtige Rolle in zukĂŒnftigen Fahrzeugen um die Sicherheit fĂŒr den Fahrer, der FahrgĂ€ste und ungeschĂŒtzte Verkehrsteilnehmer wie FuĂgĂ€nger und Radfahrer zu erhöhen. Diese Art von Systemen versucht in begrenztem Rahmen, ZusammenstöĂe in gefĂ€hrlichen Situationen mit einem unaufmerksamen Fahrer und FuĂgĂ€nger durch das Auslösen einer automatischen Notbremsung zu vermeiden. Aufgrund der hohen VariabilitĂ€t an FuĂgĂ€ngerbewegungsmustern werden bestehende Systeme in einer konservativen Art und Weise konzipiert, um durch eine Restriktion auf beherrschbare Umgebungen mögliche Fehlauslöseraten drastisch zu reduzieren, wie z.B. in Szenarien in denen FuĂgĂ€nger plötzlich anhalten und dadurch die Situation deeskalieren. Um dieses Problem zu ĂŒberwinden, stellt eine zuverlĂ€ssige FuĂgĂ€ngerabsichtserkennung und Pfad\-vorhersage einen groĂen Wert dar.
In dieser Arbeit wird die gesamte Ablaufkette eines Stereo-Video basierten Systems zur IntentionsschĂ€tzung und Pfadvorhersage von FuĂgĂ€ngern beschrieben, welches in einer spĂ€teren Funktionsentscheidung fĂŒr eine automatische Notbremsung verwendet wird.
Im ersten von drei Hauptbestandteilen wird ein Echtzeit-Verfahren vorgeschlagen, das in niedrig aufgelösten Bildern aus komplexen und hoch dynamischen Innerstadt-Szenarien versucht, die Köpfe von FuĂgĂ€ngern zu lokalisieren und deren Pose zu schĂ€tzen. Einzelbild-basierte SchĂ€tzungen werden aus den Wahrscheinlichkeitsausgaben von acht angelernten Kopfposen-spezifischen Detektoren abgeleitet, die im Bildbereich eines FuĂgĂ€ngerkandidaten angewendet werden. Weitere Robustheit in der Kopflokalisierung wird durch Hinzunahme von Stereo-Tiefeninformation erreicht. DarĂŒber hinaus werden die Kopfpositionen und deren Pose ĂŒber die Zeit durch die Implementierung eines Partikelfilters geglĂ€ttet.
FĂŒr die IntentionsschĂ€tzung von FuĂgĂ€ngern wird die Verwendung eines robusten und leistungsstarken Ansatzes des Maschinellen Lernens in unterschiedlichen Szenarien untersucht. Dieser Ansatz ist in der Lage, fĂŒr Zeitreihen von Beobachtungen, die inneren Unterstrukturen einer bestimmten Absichtsklasse zu modellieren und zusĂ€tzlich die extrinsische Dynamik zwischen unterschiedlichen Absichtsklassen zu erfassen. Das Verfahren integriert bedeutsame extrahierte Merkmale aus der FuĂgĂ€ngerdynamik sowie Kontextinformationen mithilfe der menschlichen Kopfpose.
Zum Schluss wird ein Verfahren zur Pfadvorhersage vorgestellt, welches die PrĂ€diktionsschritte eines Filters fĂŒr multiple Bewegungsmodelle fĂŒr einen Zeithorizont von ungefĂ€hr einer Sekunde durch Einbeziehung der geschĂ€tzten FuĂgĂ€ngerabsichten steuert. Durch Hilfestellungen fĂŒr den Filter das geeignete Bewegungsmodell zu wĂ€hlen, kann der resultierende PfadprĂ€diktionsfehler um ein signifikantes MaĂ reduziert werden. Eine Vielzahl von Szenarien wird behandelt, einschlieĂlich seitlich querender oder anhaltender FuĂgĂ€nger oder Personen, die zunĂ€chst entlang des BĂŒrgersteigs gehen aber dann plötzlich in Richtung der Fahrbahn einbiegen
Stereoscopic high dynamic range imaging
Two modern technologies show promise to dramatically increase immersion in
virtual environments. Stereoscopic imaging captures two images representing
the views of both eyes and allows for better depth perception. High dynamic
range (HDR) imaging accurately represents real world lighting as opposed to
traditional low dynamic range (LDR) imaging. HDR provides a better contrast
and more natural looking scenes. The combination of the two technologies in
order to gain advantages of both has been, until now, mostly unexplored due to
the current limitations in the imaging pipeline. This thesis reviews both fields,
proposes stereoscopic high dynamic range (SHDR) imaging pipeline outlining the
challenges that need to be resolved to enable SHDR and focuses on capture and
compression aspects of that pipeline.
The problems of capturing SHDR images that would potentially require two
HDR cameras and introduce ghosting, are mitigated by capturing an HDR and
LDR pair and using it to generate SHDR images. A detailed user study compared
four different methods of generating SHDR images. Results demonstrated that
one of the methods may produce images perceptually indistinguishable from the
ground truth.
Insights obtained while developing static image operators guided the design
of SHDR video techniques. Three methods for generating SHDR video from an
HDR-LDR video pair are proposed and compared to the ground truth SHDR
videos. Results showed little overall error and identified a method with the least
error.
Once captured, SHDR content needs to be efficiently compressed. Five SHDR
compression methods that are backward compatible are presented. The proposed
methods can encode SHDR content to little more than that of a traditional single
LDR image (18% larger for one method) and the backward compatibility property
encourages early adoption of the format.
The work presented in this thesis has introduced and advanced capture and
compression methods for the adoption of SHDR imaging. In general, this research
paves the way for a novel field of SHDR imaging which should lead to improved
and more realistic representation of captured scenes
Single View Modeling and View Synthesis
This thesis develops new algorithms to produce 3D content from a single camera. Today, amateurs can use hand-held camcorders to capture and display the 3D world in 2D, using mature technologies. However, there is always a strong desire to record and re-explore the 3D world in 3D. To achieve this goal, current approaches usually make use of a camera array, which suffers from tedious setup and calibration processes, as well as lack of portability, limiting its application to lab experiments.
In this thesis, I try to produce the 3D contents using a single camera, making it as simple as shooting pictures. It requires a new front end capturing device rather than a regular camcorder, as well as more sophisticated algorithms. First, in order to capture the highly detailed object surfaces, I designed and developed a depth camera based on a novel technique called light fall-off stereo (LFS). The LFS depth camera outputs color+depth image sequences and achieves 30 fps, which is necessary for capturing dynamic scenes. Based on the output color+depth images, I developed a new approach that builds 3D models of dynamic and deformable objects. While the camera can only capture part of a whole object at any instance, partial surfaces are assembled together to form a complete 3D model by a novel warping algorithm.
Inspired by the success of single view 3D modeling, I extended my exploration into 2D-3D video conversion that does not utilize a depth camera. I developed a semi-automatic system that converts monocular videos into stereoscopic videos, via view synthesis. It combines motion analysis with user interaction, aiming to transfer as much depth inferring work from the user to the computer. I developed two new methods that analyze the optical flow in order to provide additional qualitative depth constraints. The automatically extracted depth information is presented in the user interface to assist with user labeling work.
In this thesis, I developed new algorithms to produce 3D contents from a single camera. Depending on the input data, my algorithm can build high fidelity 3D models for dynamic and deformable objects if depth maps are provided. Otherwise, it can turn the video clips into stereoscopic video
Stereoscopic depth estimation for online vision systems
A lot of work has been done in the area of machine stereo vision, but a severe drawback of today's algorithms is that they either achieve high accuracy and robustness by sacrificing real-time speed or they are real-time capable but with major deficiencies in quality. In order to tackle this problem this thesis presents two new methods which exhibit a very good balance between computational effort and depth accuracy.
First, the summed normalized cross-correlation is proposed which constitutes a new cost function for block-matching stereo processing. In contrast to most standard cost functions it hardly suffers from the fattening effect while being computationally very efficient. Second, the direct surface fitting, a new algorithm for fitting parametric surface models to stereo images, is introduced. This algorithm is inspired by the homography-constrained gradient descent methods but in contrast to these allows also for the estimation of non-planar surfaces. Experimental evaluations demonstrate that both newly introduced algorithms are competitive to state-of-the-art in terms of accuracy while having a much lower computational time.Die visuelle Wahrnehmung des Menschen wird in hohem MaĂe vom
stereoskopischenSehen beeinflusst. Die dreidimensionale Wahrnehmung
entsteht dabei durch dieleicht unterschiedlichen Blickwinkel der beiden
Augen. Es ist eine nahe liegendeAnnahmen, dass maschinelle Sehsysteme
ebenfalls von einem vergleichbaren Sinnprofitieren können. Obwohl es
bereits zahlreiche Arbeiten auf dem Gebiet desmaschinellen stereoskopischen
Sehen gibt, erfĂŒllen die heutigen Algorithmenentweder nicht die
Anforderungen fĂŒr eine effiziente Berechnung oder aber siehaben nur eine
geringe Genauigkeit und Robustheit.
Das Ziel dieser Doktorarbeit ist die Entwicklung von echtzeit-
undrealweltfÀhigen stereoskopischen Algorithmen. Insbesondere soll die
Berechnungder Algorithmen leichtgewichtig genug sein, um auf mobilen
Plattformeneingesetzt werden zu können. Dazu werden im Rahmen dieser
Arbeit zwei neueMethoden vorgestellt, welche sich durch eine gute Balance
zwischenGeschwindigkeit und Genauigkeit auszeichnen.
Als erstes wird die "Summed Normalized Cross-Correlation" (SNCC)
vorgestellt,eine neue Kostenfunktion fĂŒr blockvergleichende,
stereoskopischeTiefenschÀtzung. Im Unterschied zu den meisten anderen
Kostenfunktionen ist SNCCnicht anfĂ€llig fĂŒr den qualitĂ€tsmindernden
"Fattening"-Effekt, kann abertrotzdem sehr effizient berechnet werden. Die
Auswertung der Genauigkeit aufStandard Benchmark-Tests zeigt, dass mit SNCC
die Genauigkeit von lokaler,blockvergleichsbasierter, stereoskopischer
Berechnung nahe an die Genauigkeitvon global optimierenden Methoden
basierend auf "Graph Cut" oder "BeliefPropagation" heran kommt.
Die zweite vorgestellte Methode ist das "Direct Surface Fitting", ein
neuerAlgorithmus zum SchÀtzen parametrischer OberflÀchenmodelle an Hand
vonStereobildern. Dieser Algorithmus ist inspiriert vom
Homographie-beschrÀnktenGradientenabstieg, welcher hÀufig dazu benutzt
wird um die Lage von planarenOberflÀchen im Raum zu SchÀtzen. Durch die
Ersetzung des Gradientenabstiegs mitder direkten Suchmethodik von Hooke und
Jeeves wird die planare SchÀtzung aufbeliebige parametrische
OberflÀchenmodelle und beliebige Kostenfunktionenerweitert. Ein Vergleich
auf Standard Benchmark-Tests zeigt, dass "DirectSurface Fitting" eine
vergleichbare Genauigkeit wie Methoden aus dem Stand derTechnik hat, im
Gegensatz zu diesen aber höhere Robustheit in anspruchsvollenSituationen
besitzt.
Um die Realwelttauglichkeit und Effizienz der vorgestellten Methoden
zuuntermauern wurden diese in ein Automobil- und in ein
Robotersystemintegriert. Die mit diesen mobilen Systemen durchgefĂŒhrten
Experimentedemonstrieren die hohe Robustheit und StabilitÀt der
eingefĂŒhrten Methoden
Advances in Stereo Vision
Stereopsis is a vision process whose geometrical foundation has been known for a long time, ever since the experiments by Wheatstone, in the 19th century. Nevertheless, its inner workings in biological organisms, as well as its emulation by computer systems, have proven elusive, and stereo vision remains a very active and challenging area of research nowadays. In this volume we have attempted to present a limited but relevant sample of the work being carried out in stereo vision, covering significant aspects both from the applied and from the theoretical standpoints
- âŠ