676 research outputs found

    Acquiring 3D scene information from 2D images

    Get PDF
    In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved

    Web service for automatic generation of 3D models through video

    Get PDF
    Esta dissertação tem como objetivo criar um serviço web escalável capaz de gerar modelos 3D através de vídeo. De forma a melhorar a usabilidade e reduzir custos, é esperado que o vídeo seja adquirido através de um sistema de aquisição monocular. De forma a tornar esta ferramenta o mais simples e flexível possível vai ser criado um serviço web. Isto permite que developers criem outras aplicações que podem ser acedidas por dispositivos menos poderosos, como smartphones, desde que tenham uma ligação à Internet. A maior contribuição desta dissertação é a criação da arquitetura do sistema cloud, adaptando os algoritmos de reconstrução 3D existentes a esta nova realidade, uma vez que a maioria dos algoritmos de visão por computador não são distribuídos.This thesis aims to create a scalable web service capable of generating 3D models through video. In order to improve usability and reduce costs, it is expected that the video is acquired through a monocular aquisition system. In order to make this tool as simple and flexible as possible, a web service is going to be developed. This allows developers to create other applications that can be accessed in lesser powerful devices, such as smartphones, provided that they have an Internet connection. The major contribution of this thesis is the creation of the cloud's system architecture, adapting the existing 3D reconstruction algorithms to this new reality, since the majority of the computer vision algorithms are not distributed

    Real Time Sequential Non Rigid Structure from motion using a single camera

    Get PDF
    En la actualidad las aplicaciones que basan su funcionamiento en una correcta localización y reconstrucción dentro de un entorno real en 3D han experimentado un gran interés en los últimos años, tanto por la comunidad investigadora como por la industrial. Estas aplicaciones varían desde la realidad aumentada, la robótica, la simulación, los videojuegos, etc. Dependiendo de la aplicación y del nivel de detalle de la reconstrucción, se emplean diversos dispositivos, algunos específicos, más complejos y caros como las cámaras estéreo, cámara y profundidad (RGBD) con Luz estructurada y Time of Flight (ToF), así como láser y otros más avanzados. Para aplicaciones sencillas es suficiente con dispositivos de uso común, como los smartphones, en los que aplicando técnicas de visión artificial, se pueden obtener modelos 3D del entorno para, en el caso de la realidad aumentada, mostrar información aumentada en la ubicación seleccionada.En robótica, la localización y generación simultáneas de un mapa del entorno en 3D es una tarea fundamental para conseguir la navegación autónoma. Este problema se conoce en el estado del arte como Simultaneous Localization And Mapping (SLAM) o Structure from Motion (SfM). Para la aplicación de estas técnicas, el objeto no ha de cambiar su forma a lo largo del tiempo. La reconstrucción es unívoca salvo factor de escala en captura monocular sin referencia. Si la condición de rigidez no se cumple, es porque la forma del objeto cambia a lo largo del tiempo. El problema sería equivalente a realizar una reconstrucción por fotograma, lo cual no se puede hacer de manera directa, puesto que diferentes formas, combinadas con diferentes poses de cámara pueden dar proyecciones similares. Es por esto que el campo de la reconstrucción de objetos deformables es todavía un área en desarrollo. Los métodos de SfM se han adaptado aplicando modelos físicos, restricciones temporales, espaciales, geométricas o de otros tipos para reducir la ambigüedad en las soluciones, naciendo así las técnicas conocidas como Non-Rigid SfM (NRSfM).En esta tesis se propone partir de una técnica de reconstrucción rígida bien conocida en el estado del arte como es PTAM (Parallel Tracking and Mapping) y adaptarla para incluir técnicas de NRSfM, basadas en modelo de bases lineales para estimar las deformaciones del objeto modelado dinámicamente y aplicar restricciones temporales y espaciales para mejorar las reconstrucciones, además de ir adaptándose a cambios de deformación que se presenten en la secuencia. Para ello, hay que realizar cambios de manera que cada uno de sus hilos de ejecución procesen datos no rígidos.El hilo encargado del seguimiento ya realizaba seguimiento basado en un mapa de puntos 3D, proporcionado a priori. La modificación más importante aquí es la integración de un modelo de deformación lineal para que se realice el cálculo de la deformación del objeto en tiempo real, asumiendo fijas las formas básicas de deformación. El cálculo de la pose de la cámara está basado en el sistema de estimación rígido, por lo que la estimación de pose y coeficientes de deformación se hace de manera alternada usando el algoritmo E-M (Expectation-Maximization). También, se imponen restricciones temporales y de forma para restringir las ambigüedades inherentes en las soluciones y mejorar la calidad de la estimación 3D.Respecto al hilo que gestiona el mapa, se actualiza en función del tiempo para que sea capaz de mejorar las bases de deformación cuando éstas no son capaces de explicar las formas que se ven en las imágenes actuales. Para ello, se sustituye la optimización de modelo rígido incluida en este hilo por un método de procesamiento exhaustivo NRSfM, para mejorar las bases acorde a las imágenes con gran error de reconstrucción desde el hilo de seguimiento. Con esto, el modelo se consigue adaptar a nuevas deformaciones, permitiendo al sistema evolucionar y ser estable a largo plazo.A diferencia de una gran parte de los métodos de la literatura, el sistema propuesto aborda el problema de la proyección perspectiva de forma nativa, minimizando los problemas de ambigüedad y de distancia al objeto existente en la proyección ortográfica. El sistema propuesto maneja centenares de puntos y está preparado para cumplir con restricciones de tiempo real para su aplicación en sistemas con recursos hardware limitados

    Harnessing Big Data and Machine Learning for Event Detection and Localization

    Get PDF
    Anomalous events are rare and significantly deviate from expected pattern and other data instances, making them hard to predict. Correctly and timely detecting anomalous severe events can help reduce risks and losses. Many anomalous event detection techniques are studied in the literature. Recently, big data and machine learning based techniques have shown a remarkable success in a wide range of fields. It is important to tailor big data and machine learning based techniques for each application; otherwise it may result in expensive computation, slow prediction, false alarms, and improper prediction granularity.First, we aim to address the above challenges by harnessing big data and machine learning techniques for fast and reliable prediction and localization of severe events. Firstly, to improve storage failure prediction, we develop a new lightweight and high performing tensor decomposition-based method, named SEFEE, for storage error forecasting in large-scale enterprise storage systems. SEFEE employs tensor decomposition technique to capture latent spatio-temporal information embedded in storage event logs. By utilizing the latent spatio-temporal information, we can make accurate storage error forecasting without training requirements of typical machine learning techniques. The training-free method allows for live prediction of storage errors and their locations in the storage system based on previous observations that had been used in tensor decomposition pipeline to extract meaningful latent correlations. Moreover, we propose an extension to include severity of the errors as contextual information to improve the accuracy of tensor decomposition which in turn improves the prediction accuracy. We further provide detailed characterization of NetApp dataset to provide additional insight into the dynamics of typical large-scale enterprise storage systems for the community.Next, we focus on another application -- AI-driven Wildfire prediction. Wildfires cause billions of dollars in property damages and loss of lives, with harmful health threats. We aim to correctly detect and localize wildfire events in the early stage and also classify wildfire smoke based on perceived pixel density of camera images. Due to the lack of publicly available dataset for early wildfire smoke detection, we first collect and process images from the AlertWildfire camera network. The images are annotated with bounding boxes and densities for deep learning methods to use. We then adapt a transformer-based end-to-end object detection model for wildfire detection using our dataset. The dataset and detection model together form as a benchmark named the Nevada smoke detection benchmark, or Nemo for short. Nemo is the first open-source benchmark for wildfire smoke detection with the focus of the early incipient stage. We further provide a weakly supervised Nemo version to enable wider support as a benchmark

    Detection and Simulation of Dangerous Human Crowd Behavior

    Get PDF
    Tragically, gatherings of large human crowds quite often end in crowd disasters such as the recent catastrophe at the Loveparade 2010. In the past, research on pedestrian and crowd dynamics focused on simulation of pedestrian motion. As of yet, however, there does not exist any automatic system which can detect hazardous situations in crowds, thus helping to prevent these tragic incidents. In the thesis at hand, we analyze pedestrian behavior in large crowds and observe characteristic motion patterns. Based on our findings, we present a computer vision system that detects unusual events and critical situations from video streams and thus alarms security personnel in order to take necessary actions. We evaluate the system’s performance on synthetic, experimental as well as on real-world data. In particular, we show its effectiveness on the surveillance videos recorded at the Loveparade crowd stampede. Since our method is based on optical flow computations, it meets two crucial prerequisites in video surveillance: Firstly, it works in real-time and, secondly, the privacy of the people being monitored is preserved. In addition to that, we integrate the observed motion patterns into models for simulating pedestrian motion and show that the proposed simulation model produces realistic trajectories. We employ this model to simulate large human crowds and use techniques from computer graphics to render synthetic videos for further evaluation of our automatic video surveillance system

    Combining Features and Semantics for Low-level Computer Vision

    Get PDF
    Visual perception of depth and motion plays a significant role in understanding and navigating the environment. Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving. The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information. Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions. Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects. Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem Verständnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die Schätzung der Bewegung von Videokameras sind von größter Bedeutung für Anwendungen, wie das autonome Fahren. Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfür sind reflektierende und texturlose Oberflächen oder große Bewegungen, bei denen herkömmliche lokale Methoden häufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige Lichtverhältnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern. Für die binokulare Stereo Schätzung schlagen wir zuallererst vor zusammenhängende Bereiche mit objektklassen-spezifischen Disparitäts Vorschlägen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spärlichen Disparitätsschätzung und semantischen Segmentierung des Bildes erhalten. Die Disparitäts Vorschläge kodieren die Tatsache, dass die Gegenstände bestimmter Kategorien nicht willkürlich geformt sind, sondern typischerweise regelmäßige Strukturen aufweisen. Wir integrieren sie für die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen. Zweitens nutzen wir für die 3D-Rekonstruktion die Tatsache, dass mit der Größe der rekonstruierten Fläche auch die Wahrscheinlichkeit steigt, Objekte von ähnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders für Szenen im Freien, in denen Gebäude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber ähnlichkeit in der Form aufweisen. Wir nutzen diese ähnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, während fehlende Flächen vervollständigt werden, da Objekte ähnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstädtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen Abhängigkeiten zwischen Objekten. Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, präsentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusätzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. Für das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm für eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes für das Feature Matching

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF

    Self-Calibration of Multi-Camera Systems for Vehicle Surround Sensing

    Get PDF
    Multi-camera systems are being deployed in a variety of vehicles and mobile robots today. To eliminate the need for cost and labor intensive maintenance and calibration, continuous self-calibration is highly desirable. In this book we present such an approach for self-calibration of multi-Camera systems for vehicle surround sensing. In an extensive evaluation we assess our algorithm quantitatively using real-world data

    From Calibration to Large-Scale Structure from Motion with Light Fields

    Get PDF
    Classic pinhole cameras project the multi-dimensional information of the light flowing through a scene onto a single 2D snapshot. This projection limits the information that can be reconstructed from the 2D acquisition. Plenoptic (or light field) cameras, on the other hand, capture a 4D slice of the plenoptic function, termed the “light field”. These cameras provide both spatial and angular information on the light flowing through a scene; multiple views are captured in a single photographic exposure facilitating various applications. This thesis is concerned with the modelling of light field (or plenoptic) cameras and the development of structure from motion pipelines using such cameras. Specifically, we develop a geometric model for a multi-focus plenoptic camera, followed by a complete pipeline for the calibration of the suggested model. Given a calibrated light field camera, we then remap the captured light field to a grid of pinhole images. We use these images to obtain metric 3D reconstruction through a novel framework for structure from motion with light fields. Finally, we suggest a linear and efficient approach for absolute pose estimation for light fields
    corecore