    Hierarchical structure-and-motion recovery from uncalibrated images

    This paper addresses the structure-and-motion problem, that requires to find camera motion and 3D struc- ture from point matches. A new pipeline, dubbed Samantha, is presented, that departs from the prevailing sequential paradigm and embraces instead a hierarchical approach. This method has several advantages, like a provably lower computational complexity, which is necessary to achieve true scalability, and better error containment, leading to more stability and less drift. Moreover, a practical autocalibration procedure allows to process images without ancillary information. Experiments with real data assess the accuracy and the computational efficiency of the method.Comment: Accepted for publication in CVI

    Stereo Reconstruction using Induced Symmetry and 3D scene priors

    Tese de doutoramento em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia da Universidade de CoimbraRecuperar a geometria 3D a partir de dois vistas, conhecida como reconstrução estéreo, é um dos tópicos mais antigos e mais investigado em visão por computador. A computação de modelos 3D do ambiente é útil para uma grande número de aplicações, desde a robótica‎, passando pela sua utilização do consumidor comum, até a procedimentos médicos. O princípio para recuperar a estrutura 3D cena é bastante simples, no entanto, existem algumas situações que complicam consideravelmente o processo de reconstrução. Objetos que contêm estruturas pouco texturadas ou repetitivas, e superfícies com bastante inclinação ainda colocam em dificuldade os algoritmos state-of-the-art. Esta tese de doutoramento aborda estas questões e apresenta um novo framework estéreo que é completamente diferente das abordagens convencionais. Propomos a utilização de simetria em vez de foto-similaridade para avaliar a verosimilhança de pontos em duas imagens distintas serem uma correspondência. O framework é chamado SymStereo, e baseia-se no efeito de espelhagem que surge sempre que uma imagem é mapeada para a outra câmera usando a homografia induzida por um plano de corte virtual que intersecta a baseline. Experiências em estéreo denso comprovam que as nossas funções de custo baseadas em simetria se comparam favoravelmente com os custos baseados em foto-consistência de melhor desempenho. Param além disso, investigamos a possibilidade de realizar Stereo-Rangefinding, que consiste em usar estéreo passivo para recuperar exclusivamente a profundidade ao longo de um plano de varrimento. Experiências abrangentes fornecem evidência de que estéreo baseada em simetria induzida é especialmente eficaz para esta finalidade. Como segunda linha de investigação, propomos superar os problemas descritos anteriormente usando informação a priori sobre o ambiente 3D, com o objectivo de aumentar a robustez do processo de reconstrução. Para tal, apresentamos uma nova abordagem global para detectar pontos de desvanecimento e grupos de direcções de desvanecimento mutuamente ortogonais em ambientes Manhattan. Experiências quer em imagens sintéticas quer em imagens reais demonstram que os nossos algoritmos superaram os métodos state-of-the-art, mantendo a computação aceitável. Além disso, mostramos pela primeira vez resultados na detecção simultânea de múltiplas configurações de Manhattan. Esta informação a priori sobre a estrutura da cena é depois usada numa pipeline de reconstrução que gera modelos piecewise planares de ambientes urbanos a partir de duas vistas calibradas. A nossa formulação combina SymStereo e o algoritmo de clustering PEARL [3], e alterna entre um passo de otimização discreto, que funde hipóteses de superfícies planares e descarta detecções com pouco suporte, e uma etapa de otimização contínua, que refina as poses dos planos. Experiências com pares estéreo de ambientes interiores e exteriores confirmam melhorias significativas sobre métodos state-of-the-art relativamente a precisão e robustez. Finalmente, e como terceira contribuição para melhorar a visão estéreo na presença de superfícies inclinadas, estendemos o recente framework de agregação estéreo baseada em histogramas [4]. O algoritmo original utiliza janelas de suporte fronto-paralelas para a agregação de custo, o que leva a resultados imprecisos na presença de superfícies com inclinação significativa. Nós abordamos o problema considerando hipóteses de orientação discretas. Os resultados experimentais obtidos comprovam a eficácia do método, permitindo melhorar a precisção de correspondência, preservando simultaneamente uma baixa complexidade computacional.Recovering the 3D geometry from two or more views, known as stereo reconstruction, is one of the earliest and most investigated topics in computer vision. The computation of 3D models of an environment is useful for a very large number of applications, ranging from robotics, consumer utilization to medical procedures. The principle to recover the 3D scene structure is quite simple, however, there are some issues that considerable complicate the reconstruction process. Objects containing complicated structures, including low and repetitive textures, and highly slanted surfaces still pose difficulties to state-of-the-art algorithms. This PhD thesis tackles this issues and introduces a new stereo framework that is completely different from conventional approaches. We propose to use symmetry instead of photo-similarity for assessing the likelihood of two image locations being a match. The framework is called SymStereo, and is based on the mirroring effect that arises whenever one view is mapped into the other using the homography induced by a virtual cut plane that intersects the baseline. Extensive experiments in dense stereo show that our symmetry-based cost functions compare favorably against the best performing photo-similarity matching costs. In addition, we investigate the possibility of accomplishing Stereo-Rangefinding that consists in using passive stereo to exclusively recover depth along a scan plane. Thorough experiments provide evidence that Stereo from Induced Symmetry is specially well suited for this purpose. As a second research line, we propose to overcome the previous issues using priors about the 3D scene for increasing the robustness of the reconstruction process. For this purpose, we present a new global approach for detecting vanishing points and groups of mutually orthogonal vanishing directions in man-made environments. Experiments in both synthetic and real images show that our algorithms outperform the state-of-the-art methods while keeping computation tractable. In addition, we show for the first time results in simultaneously detecting multiple Manhattan-world configurations. This prior information about the scene structure is then included in a reconstruction pipeline that generates piece-wise planar models of man-made environments from two calibrated views. Our formulation combines SymStereo and PEARL clustering [3], and alternates between a discrete optimization step, that merges planar surface hypotheses and discards detections with poor support, and a continuous optimization step, that refines the plane poses. Experiments with both indoor and outdoor stereo pairs show significant improvements over state-of-the-art methods with respect to accuracy and robustness. Finally, and as a third contribution to improve stereo matching in the presence of surface slant, we extend the recent framework of Histogram Aggregation [4]. The original algorithm uses a fronto-parallel support window for cost aggregation, leading to inaccurate results in the presence of significant surface slant. We address the problem by considering discrete orientation hypotheses. The experimental results prove the effectiveness of the approach, which enables to improve the matching accuracy while preserving a low computational complexity

    Physically based geometry and reflectance recovery from images

    An image is a projection of the three-dimensional world taken at an instance in space and time. Its formation involves a complex interplay between geometry, illumination and material properties of objects in the scene. Given image data and knowledge of some scene properties, the recovery of the remaining components can be cast as a set of physically based inverse problems. This thesis investigates three inverse problems on the recovery of scene properties and discusses how we can develop appropriate physical constraints and build them into effective algorithms. Firstly, we study the problem of geometry recovery from a single image with repeated texture. Our technique leverages the PatchMatch algorithm to detect and match repeated patterns undergoing geometric transformations. This allows effective enforcement of translational symmetry constraint in the recovery of texture lattice. Secondly, we study the problem of computational relighting using RGB-D data, where the depth data is acquired through a Kinect sensor and is often noisy. We show how the inclusion of noisy depth input helps to resolve ambiguities in the recovery of shape and reflectance in the inverse rendering problem. Our results show that the complementary nature of RGB and depth is highly beneficial for a practical relighting system. Lastly, in the third problem, we exploit the use of geometric constraints relating two views, to address a challenging problem in Internet image matching. Our solution is robust to geometric and photometric distortions over wide baselines. It also accommodates repeated structures that are commonly found in our modern environment. Building on the image correspondence, we also investigate the use of color transfer as an additional global constraint in relating Internet images. It shows promising results in obtaining more accurate and denser correspondence

    Stereo Visual SLAM for Mobile Robots Navigation

    Esta tesis está enfocada a la combinación de los campos de la robótica móvil y la visión por computador, con el objetivo de desarrollar métodos que permitan a un robot móvil localizarse dentro de su entorno mientras construye un mapa del mismo, utilizando como única entrada un conjunto de imágenes. Este problema se denomina SLAM visual (por las siglas en inglés de "Simultaneous Localization And Mapping") y es un tema que aún continúa abierto a pesar del gran esfuerzo investigador realizado en los últimos años. En concreto, en esta tesis utilizamos cámaras estéreo para capturar, simultáneamente, dos imágenes desde posiciones ligeramente diferentes, proporcionando así información 3D de forma directa. De entre los problemas de localización de robots, en esta tesis abordamos dos de ellos: el seguimiento de robots y la localización y mapeado simultáneo (o SLAM). El primero de ellos no tiene en cuenta el mapa del entorno sino que calcula la trayectoria del robot mediante la composición incremental de las estimaciones de su movimiento entre instantes de tiempo consecutivos. Cuando se usan imágenes para calcular esta trayectoria, el problema toma el nombre de "odometría visual", y su resolución es más sencilla que la del SLAM visual. De hecho, a menudo se integra como parte de un sistema de SLAM completo. Esta tesis contribuye con la propuesta de dos sistemas de odometría visual. Uno de ellos está basado en un solución cerrada y eficiente mientras que el otro está basado en un proceso de optimización no-lineal que implementa un nuevo método de detección y eliminación rápida de espurios. Los métodos de SLAM, por su parte, también abordan la construcción de un mapa del entorno con el objetivo de mejorar sensiblemente la localización del robot, evitando de esta forma la acumulación de error en la que incurre la odometría visual. Además, el mapa construido puede ser empleado para hacer frente a situaciones exigentes como la recuperación de la localización tras la pérdida del robot o realizar localización global. En esta tesis se presentan dos sistemas completos de SLAM visual. Uno de ellos se ha implementado dentro del marco de los filtros probabilísticos no parámetricos, mientras que el otro está basado en un método nuevo de "bundle adjustment" relativo que ha sido integrado con algunas técnicas recientes de visión por computador. Otra contribución de esta tesis es la publicación de dos colecciones de datos que contienen imágenes estéreo capturadas en entornos urbanos sin modificar, así como una estimación del camino real del robot basada en GPS (denominada "ground truth"). Estas colecciones sirven como banco de pruebas para validar métodos de odometría y SLAM visual

    On unifying sparsity and geometry for image-based 3D scene representation

    Demand has emerged for next generation visual technologies that go beyond conventional 2D imaging. Such technologies should capture and communicate all perceptually relevant three-dimensional information about an environment to a distant observer, providing a satisfying, immersive experience. Camera networks offer a low cost solution to the acquisition of 3D visual information, by capturing multi-view images from different viewpoints. However, the camera's representation of the data is not ideal for common tasks such as data compression or 3D scene analysis, as it does not make the 3D scene geometry explicit. Image-based scene representations fundamentally require a multi-view image model that facilitates extraction of underlying geometrical relationships between the cameras and scene components. Developing new, efficient multi-view image models is thus one of the major challenges in image-based 3D scene representation methods. This dissertation focuses on defining and exploiting a new method for multi-view image representation, from which the 3D geometry information is easily extractable, and which is additionally highly compressible. The method is based on sparse image representation using an overcomplete dictionary of geometric features, where a single image is represented as a linear combination of few fundamental image structure features (edges for example). We construct the dictionary by applying a unitary operator to an analytic function, which introduces a composition of geometric transforms (translations, rotation and anisotropic scaling) to that function. The advantage of this approach is that the features across multiple views can be related with a single composition of transforms. We then establish a connection between image components and scene geometry by defining the transforms that satisfy the multi-view geometry constraint, and obtain a new geometric multi-view correlation model. We first address the construction of dictionaries for images acquired by omnidirectional cameras, which are particularly convenient for scene representation due to their wide field of view. Since most omnidirectional images can be uniquely mapped to spherical images, we form a dictionary by applying motions on the sphere, rotations, and anisotropic scaling to a function that lives on the sphere. We have used this dictionary and a sparse approximation algorithm, Matching Pursuit, for compression of omnidirectional images, and additionally for coding 3D objects represented as spherical signals. Both methods offer better rate-distortion performance than state of the art schemes at low bit rates. The novel multi-view representation method and the dictionary on the sphere are then exploited for the design of a distributed coding method for multi-view omnidirectional images. In a distributed scenario, cameras compress acquired images without communicating with each other. Using a reliable model of correlation between views, distributed coding can achieve higher compression ratios than independent compression of each image. However, the lack of a proper model has been an obstacle for distributed coding in camera networks for many years. We propose to use our geometric correlation model for distributed multi-view image coding with side information. The encoder employs a coset coding strategy, developed by dictionary partitioning based on atom shape similarity and multi-view geometry constraints. Our method results in significant rate savings compared to independent coding. An additional contribution of the proposed correlation model is that it gives information about the scene geometry, leading to a new camera pose estimation method using an extremely small amount of data from each camera. Finally, we develop a method for learning stereo visual dictionaries based on the new multi-view image model. Although dictionary learning for still images has received a lot of attention recently, dictionary learning for stereo images has been investigated only sparingly. Our method maximizes the likelihood that a set of natural stereo images is efficiently represented with selected stereo dictionaries, where the multi-view geometry constraint is included in the probabilistic modeling. Experimental results demonstrate that including the geometric constraints in learning leads to stereo dictionaries that give both better distributed stereo matching and approximation properties than randomly selected dictionaries. We show that learning dictionaries for optimal scene representation based on the novel correlation model improves the camera pose estimation and that it can be beneficial for distributed coding

    Enhancing low-level features with mid-level cues

    Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties. This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work. In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features. We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient. Our main contributions are as follows: (a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines. (b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion. (c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window"

    Toward Efficient and Robust Large-Scale Structure-from-Motion Systems

    The ever-increasing number of images that are uploaded and shared on the Internet has recently been leveraged by computer vision researchers to extract 3D information about the content seen in these images. One key mechanism to extract this information is structure-from-motion, which is the process of recovering the 3D geometry (structure) of a scene via a set of images from different viewpoints (camera motion). However, when dealing with crowdsourced datasets comprised of tens or hundreds of millions of images, the magnitude and diversity of the imagery poses challenges such as robustness, scalability, completeness, and correctness for existing structure-from-motion systems. This dissertation focuses on these challenges and demonstrates practical methods to address the problems of data association and verification within structure-from-motion systems. Data association within structure-from-motion systems consists of the discovery of pairwise image overlap within the input dataset. In order to perform this discovery, previous systems assumed that information about every image in the input dataset could be stored in memory, which is prohibitive for large-scale photo collections. To address this issue, we propose a novel streaming-based framework for the discovery of related sets of images, and demonstrate our approach on a crowdsourced dataset containing 100 million images from all around the world. Results illustrate that our streaming-based approach does not compromise model completeness, but achieves unprecedented levels of efficiency and scalability. The verification of individual data associations is difficult to perform during the process of structure-from-motion, as standard methods have limited scope when determining image overlap. Therefore, it is possible for erroneous associations to form, especially when there are symmetric, repetitive, or duplicate structures which can be incorrectly associated with each other. The consequences of these errors are incorrectly placed cameras and scene geometry within the 3D reconstruction. We present two methods that can detect these local inconsistencies and successfully resolve them into a globally consistent 3D model. In our evaluation, we show that our techniques are efficient, are robust to a variety of scenes, and outperform existing approaches.Doctor of Philosoph

    Generating depth maps from stereo image pairs

