1,249 research outputs found

    Representations for Cognitive Vision : a Review of Appearance-Based, Spatio-Temporal, and Graph-Based Approaches

    Get PDF
    The emerging discipline of cognitive vision requires a proper representation of visual information including spatial and temporal relationships, scenes, events, semantics and context. This review article summarizes existing representational schemes in computer vision which might be useful for cognitive vision, a and discusses promising future research directions. The various approaches are categorized according to appearance-based, spatio-temporal, and graph-based representations for cognitive vision. While the representation of objects has been covered extensively in computer vision research, both from a reconstruction as well as from a recognition point of view, cognitive vision will also require new ideas how to represent scenes. We introduce new concepts for scene representations and discuss how these might be efficiently implemented in future cognitive vision systems

    Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs

    Full text link
    Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, voxels) or as a collection of objects. This paper attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual-inertial data. Kimera includes state-of-the-art techniques for visual-inertial SLAM, metric-semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution shows how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera are open-source.Comment: 34 pages, 25 figures, 9 tables. arXiv admin note: text overlap with arXiv:2002.0628

    Grounding semantics in robots for Visual Question Answering

    Get PDF
    In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning

    Colour videos with depth : acquisition, processing and evaluation

    Get PDF
    The human visual system lets us perceive the world around us in three dimensions by integrating evidence from depth cues into a coherent visual model of the world. The equivalent in computer vision and computer graphics are geometric models, which provide a wealth of information about represented objects, such as depth and surface normals. Videos do not contain this information, but only provide per-pixel colour information. In this dissertation, I hence investigate a combination of videos and geometric models: videos with per-pixel depth (also known as RGBZ videos). I consider the full life cycle of these videos: from their acquisition, via filtering and processing, to stereoscopic display. I propose two approaches to capture videos with depth. The first is a spatiotemporal stereo matching approach based on the dual-cross-bilateral grid – a novel real-time technique derived by accelerating a reformulation of an existing stereo matching approach. This is the basis for an extension which incorporates temporal evidence in real time, resulting in increased temporal coherence of disparity maps – particularly in the presence of image noise. The second acquisition approach is a sensor fusion system which combines data from a noisy, low-resolution time-of-flight camera and a high-resolution colour video camera into a coherent, noise-free video with depth. The system consists of a three-step pipeline that aligns the video streams, efficiently removes and fills invalid and noisy geometry, and finally uses a spatiotemporal filter to increase the spatial resolution of the depth data and strongly reduce depth measurement noise. I show that these videos with depth empower a range of video processing effects that are not achievable using colour video alone. These effects critically rely on the geometric information, like a proposed video relighting technique which requires high-quality surface normals to produce plausible results. In addition, I demonstrate enhanced non-photorealistic rendering techniques and the ability to synthesise stereoscopic videos, which allows these effects to be applied stereoscopically. These stereoscopic renderings inspired me to study stereoscopic viewing discomfort. The result of this is a surprisingly simple computational model that predicts the visual comfort of stereoscopic images. I validated this model using a perceptual study, which showed that it correlates strongly with human comfort ratings. This makes it ideal for automatic comfort assessment, without the need for costly and lengthy perceptual studies

    Advances in Stereo Vision

    Get PDF
    Stereopsis is a vision process whose geometrical foundation has been known for a long time, ever since the experiments by Wheatstone, in the 19th century. Nevertheless, its inner workings in biological organisms, as well as its emulation by computer systems, have proven elusive, and stereo vision remains a very active and challenging area of research nowadays. In this volume we have attempted to present a limited but relevant sample of the work being carried out in stereo vision, covering significant aspects both from the applied and from the theoretical standpoints

    Robust perceptual organization techniques for analysis of color images

    Get PDF
    Esta tesis aborda el desarrollo de nuevas técnicas de análisis robusto de imágenes estrechamente relacionadas con el comportamiento del sistema visual humano. Uno de los pilares de la tesis es la votación tensorial, una técnica robusta que propaga y agrega información codificada en tensores mediante un proceso similar a la convolución. Su robustez y adaptabilidad han sido claves para su uso en esta tesis. Ambas propiedades han sido verificadas en tres nuevas aplicaciones de la votación tensorial: estimación de estructura, detección de bordes y segmentación de imágenes adquiridas mediante estereovisión.El mayor problema de la votación tensorial es su elevado coste computacional. En esta línea, esta tesis propone dos nuevas implementaciones eficientes de la votación tensorial derivadas de un análisis en profundidad de esta técnica.A pesar de su capacidad de adaptación, esta tesis muestra que la formulación original de la votación tensorial (a partir de aquí, votación tensorial clásica) no es adecuada para algunas aplicaciones, dado que las hipótesis en las que se basa no se ajustan a todas ellas. Esto ocurre particularmente en el filtrado de imágenes en color. Así, esta tesis muestra que, más que un método, la votación tensorial es una metodología en la que la codificación y el proceso de votación pueden ser adaptados específicamente para cada aplicación, manteniendo el espíritu de la votación tensorial.En esta línea, esta tesis propone un marco unificado en el que se realiza a la vez el filtrado de imágenes y la detección robusta de bordes. Este marco de trabajo es una extensión de la votación tensorial clásica en la que el color y la probabilidad de encontrar un borde en cada píxel se codifican mediante tensores, y en el que el proceso de votación se basa en un conjunto de criterios perceptuales relacionados con el modo en que el sistema visual humano procesa información. Los avances recientes en la percepción del color han sido esenciales en el diseño de dicho proceso de votación.Este nuevo enfoque ha sido efectivo, obteniendo excelentes resultados en ambas aplicaciones. En concreto, el nuevo método aplicado al filtrado de imágenes tiene un mejor rendimiento que los métodos del estado del arte para ruido real. Esto lo hace más adecuado para aplicaciones reales, donde los algoritmos de filtrado son imprescindibles. Además, el método aplicado a detección de bordes produce resultados más robustos que las técnicas del estado del arte y tiene un rendimiento competitivo con relación a la completitud, discriminabilidad, precisión y rechazo de falsas alarmas.Además, esta tesis demuestra que este nuevo marco de trabajo puede combinarse con otras técnicas para resolver el problema de segmentación robusta de imágenes. Los tensores obtenidos mediante el nuevo método se utilizan para clasificar píxeles como probablemente homogéneos o no homogéneos. Ambos tipos de píxeles se segmentan a continuación por medio de una variante de un algoritmo eficiente de segmentación de imágenes basada en grafos. Los experimentos muestran que el algoritmo propuesto obtiene mejores resultados en tres de las cinco métricas de evaluación aplicadas en comparación con las técnicas del estado del arte, con un coste computacional competitivo.La tesis también propone nuevas técnicas de evaluación en el ámbito del procesamiento de imágenes. En concreto, se proponen dos métricas de filtrado de imágenes con el fin de medir el grado en que un método es capaz de preservar los bordes y evitar la introducción de defectos. Asimismo, se propone una nueva metodología para la evaluación de detectores de bordes que evita posibles sesgos introducidos por el post-procesado. Esta metodología se basa en cinco métricas para estimar completitud, discriminabilidad, precisión, rechazo de falsas alarmas y robustez. Por último, se proponen dos nuevas métricas no paramétricas para estimar el grado de sobre e infrasegmentación producido por los algoritmos de segmentación de imágenes.This thesis focuses on the development of new robust image analysis techniques more closely related to the way the human visual system behaves. One of the pillars of the thesis is the so called tensor voting technique. This is a robust perceptual organization technique that propagates and aggregates information encoded by means of tensors through a convolution like process. Its robustness and adaptability have been one of the key points for using tensor voting in this thesis. These two properties are verified in the thesis by applying tensor voting to three applications where it had not been applied so far: image structure estimation, edge detection and image segmentation of images acquired through stereo vision.The most important drawback of tensor voting is that its usual implementations are highly time consuming. In this line, this thesis proposes two new efficient implementations of tensor voting, both derived from an in depth analysis of this technique.Despite its adaptability, this thesis shows that the original formulation of tensor voting (hereafter, classical tensor voting) is not adequate for some applications, since the hypotheses from which it is based are not suitable for all applications. This is particularly certain for color image denoising. Thus, this thesis shows that, more than a method, tensor voting can be thought of as a methodology in which the encoding and voting process can be tailored for every specific application, while maintaining the tensor voting spirit.By following this reasoning, this thesis proposes a unified framework for both image denoising and robust edge detection.This framework is an extension of the classical tensor voting in which both color and edginess the likelihood of finding an edge at every pixel of the image are encoded through tensors, and where the voting process takes into account a set of plausible perceptual criteria related to the way the human visual system processes visual information. Recent advances in the perception of color have been essential for designing such a voting process.This new approach has been found effective, since it yields excellent results for both applications. In particular, the new method applied to image denoising has a better performance than other state of the art methods for real noise. This makes it more adequate for real applications, in which an image denoiser is indeed required. In addition, the method applied to edge detection yields more robust results than the state of the art techniques and has a competitive performance in recall, discriminability, precision, and false alarm rejection.Moreover, this thesis shows how the results of this new framework can be combined with other techniques to tackle the problem of robust color image segmentation. The tensors obtained by applying the new framework are utilized to classify pixels into likely homogeneous and likely inhomogeneous. Those pixels are then sequentially segmented through a variation of an efficient graph based image segmentation algorithm. Experiments show that the proposed segmentation algorithm yields better scores in three of the five applied evaluation metrics when compared to the state of the art techniques with a competitive computational cost.This thesis also proposes new evaluation techniques in the scope of image processing. First, two new metrics are proposed in the field of image denoising: one to measure how an algorithm is able to preserve edges, and the second to measure how a method is able not to introduce undesirable artifacts. Second, a new methodology for assessing edge detectors that avoids possible bias introduced by post processing is proposed. It consists of five new metrics for assessing recall, discriminability, precision, false alarm rejection and robustness. Finally, two new non parametric metrics are proposed for estimating the degree of over and undersegmentation yielded by image segmentation algorithms
    corecore