712 research outputs found

    MegaParallax: Casual 360° Panoramas with Motion Parallax

    Get PDF

    Outdoor Dynamic 3-D Scene Reconstruction

    Full text link

    Semantic Mapping of Road Scenes

    Get PDF
    The problem of understanding road scenes has been on the fore-front in the computer vision community for the last couple of years. This enables autonomous systems to navigate and understand the surroundings in which it operates. It involves reconstructing the scene and estimating the objects present in it, such as ‘vehicles’, ‘road’, ‘pavements’ and ‘buildings’. This thesis focusses on these aspects and proposes solutions to address them. First, we propose a solution to generate a dense semantic map from multiple street-level images. This map can be imagined as the bird’s eye view of the region with associated semantic labels for ten’s of kilometres of street level data. We generate the overhead semantic view from street level images. This is in contrast to existing approaches using satellite/overhead imagery for classification of urban region, allowing us to produce a detailed semantic map for a large scale urban area. Then we describe a method to perform large scale dense 3D reconstruction of road scenes with associated semantic labels. Our method fuses the depth-maps in an online fashion, generated from the stereo pairs across time into a global 3D volume, in order to accommodate arbitrarily long image sequences. The object class labels estimated from the street level stereo image sequence are used to annotate the reconstructed volume. Then we exploit the scene structure in object class labelling by performing inference over the meshed representation of the scene. By performing labelling over the mesh we solve two issues: Firstly, images often have redundant information with multiple images describing the same scene. Solving these images separately is slow, where our method is approximately a magnitude faster in the inference stage compared to normal inference in the image domain. Secondly, often multiple images, even though they describe the same scene result in inconsistent labelling. By solving a single mesh, we remove the inconsistency of labelling across the images. Also our mesh based labelling takes into account of the object layout in the scene, which is often ambiguous in the image domain, thereby increasing the accuracy of object labelling. Finally, we perform labelling and structure computation through a hierarchical robust PN Markov Random Field defined on voxels and super-voxels given by an octree. This allows us to infer the 3D structure and the object-class labels in a principled manner, through bounded approximate minimisation of a well defined and studied energy functional. In this thesis, we also introduce two object labelled datasets created from real world data. The 15 kilometre Yotta Labelled dataset consists of 8,000 images per camera view of the roadways of the United Kingdom with a subset of them annotated with object class labels and the second dataset is comprised of ground truth object labels for the publicly available KITTI dataset. Both the datasets are available publicly and we hope will be helpful to the vision research community

    Image-Based Rendering Of Real Environments For Virtual Reality

    Get PDF

    NOVEL DENSE STEREO ALGORITHMS FOR HIGH-QUALITY DEPTH ESTIMATION FROM IMAGES

    Get PDF
    This dissertation addresses the problem of inferring scene depth information from a collection of calibrated images taken from different viewpoints via stereo matching. Although it has been heavily investigated for decades, depth from stereo remains a long-standing challenge and popular research topic for several reasons. First of all, in order to be of practical use for many real-time applications such as autonomous driving, accurate depth estimation in real-time is of great importance and one of the core challenges in stereo. Second, for applications such as 3D reconstruction and view synthesis, high-quality depth estimation is crucial to achieve photo realistic results. However, due to the matching ambiguities, accurate dense depth estimates are difficult to achieve. Last but not least, most stereo algorithms rely on identification of corresponding points among images and only work effectively when scenes are Lambertian. For non-Lambertian surfaces, the brightness constancy assumption is no longer valid. This dissertation contributes three novel stereo algorithms that are motivated by the specific requirements and limitations imposed by different applications. In addressing high speed depth estimation from images, we present a stereo algorithm that achieves high quality results while maintaining real-time performance. We introduce an adaptive aggregation step in a dynamic-programming framework. Matching costs are aggregated in the vertical direction using a computationally expensive weighting scheme based on color and distance proximity. We utilize the vector processing capability and parallelism in commodity graphics hardware to speed up this process over two orders of magnitude. In addressing high accuracy depth estimation, we present a stereo model that makes use of constraints from points with known depths - the Ground Control Points (GCPs) as referred to in stereo literature. Our formulation explicitly models the influences of GCPs in a Markov Random Field. A novel regularization prior is naturally integrated into a global inference framework in a principled way using the Bayes rule. Our probabilistic framework allows GCPs to be obtained from various modalities and provides a natural way to integrate information from various sensors. In addressing non-Lambertian reflectance, we introduce a new invariant for stereo correspondence which allows completely arbitrary scene reflectance (bidirectional reflectance distribution functions - BRDFs). This invariant can be used to formulate a rank constraint on stereo matching when the scene is observed by several lighting configurations in which only the lighting intensity varies

    Real-Time Accurate Visual SLAM with Place Recognition

    Get PDF
    El problema de localización y construcción simultánea de mapas (del inglés Simultaneous Localization and Mapping, abreviado SLAM) consiste en localizar un sensor en un mapa que se construye en línea. La tecnología de SLAM hace posible la localización de un robot en un entorno desconocido para él, procesando la información de sus sensores de a bordo y por tanto sin depender de infraestructuras externas. Un mapa permite localizarse en todo momento sin acumular deriva, a diferencia de una odometría donde se integran movimientos incrementales. Este tipo de tecnología es crítica para la navegación de robots de servicio y vehículos autónomos, o para la localización del usuario en aplicaciones de realidad aumentada o virtual. La principal contribución de esta tesis es ORB-SLAM, un sistema de SLAM monocular basado en características que trabaja en tiempo real en ambientes pequeños y grandes, de interior y exterior. El sistema es robusto a elementos dinámicos en la escena, permite cerrar bucles y relocalizar la cámara incluso si el punto de vista ha cambiado significativamente, e incluye un método de inicialización completamente automático. ORB-SLAM es actualmente la solución más completa, precisa y fiable de SLAM monocular empleando una cámara como único sensor. El sistema, estando basado en características y ajuste de haces, ha demostrado una precisión y robustez sin precedentes en secuencias públicas estándar.Adicionalmente se ha extendido ORB-SLAM para reconstruir el entorno de forma semi-densa. Nuestra solución desacopla la reconstrucción semi-densa de la estimación de la trayectoria de la cámara, lo que resulta en un sistema que combina la precisión y robustez del SLAM basado en características con las reconstrucciones más completas de los métodos directos. Además se ha extendido la solución monocular para aprovechar la información de cámaras estéreo, RGB-D y sensores inerciales, obteniendo precisiones superiores a otras soluciones del estado del arte. Con el fin de contribuir a la comunidad científica, hemos hecho libre el código de una implementación de nuestra solución de SLAM para cámaras monoculares, estéreo y RGB-D, siendo la primera solución de código libre capaz de funcionar con estos tres tipos de cámara. Bibliografía:R. Mur-Artal and J. D. Tardós.Fast Relocalisation and Loop Closing in Keyframe-Based SLAM.IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China, June 2014.R. Mur-Artal and J. D. Tardós.ORB-SLAM: Tracking and Mapping Recognizable Features.RSS Workshop on Multi VIew Geometry in RObotics (MVIGRO). Berkeley, USA, July 2014. R. Mur-Artal and J. D. Tardós.Probabilistic Semi-Dense Mapping from Highly Accurate Feature-Based Monocular SLAM.Robotics: Science and Systems (RSS). Rome, Italy, July 2015.R. Mur-Artal, J. M. M. Montiel and J. D. Tardós.ORB-SLAM: A Versatile and Accurate Monocular SLAM System.IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147-1163, October 2015.(2015 IEEE Transactions on Robotics Best Paper Award).R. Mur-Artal, and J. D. Tardós.Visual-Inertial Monocular SLAM with Map Reuse.IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796-803, April 2017. (to be presented at ICRA 17).R.Mur-Artal, and J. D. Tardós. ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras.ArXiv preprint arXiv:1610.06475, 2016. (under Review).<br /

    Acquiring 3D scene information from 2D images

    Get PDF
    In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved
    corecore