129 research outputs found

    Discriminative Training of Deep Fully-connected Continuous CRF with Task-specific Loss

    Full text link
    Recent works on deep conditional random fields (CRF) have set new records on many vision tasks involving structured predictions. Here we propose a fully-connected deep continuous CRF model for both discrete and continuous labelling problems. We exemplify the usefulness of the proposed model on multi-class semantic labelling (discrete) and the robust depth estimation (continuous) problems. In our framework, we model both the unary and the pairwise potential functions as deep convolutional neural networks (CNN), which are jointly learned in an end-to-end fashion. The proposed method possesses the main advantage of continuously-valued CRF, which is a closed-form solution for the Maximum a posteriori (MAP) inference. To better adapt to different tasks, instead of using the commonly employed maximum likelihood CRF parameter learning protocol, we propose task-specific loss functions for learning the CRF parameters. It enables direct optimization of the quality of the MAP estimates during the course of learning. Specifically, we optimize the multi-class classification loss for the semantic labelling task and the Turkey's biweight loss for the robust depth estimation problem. Experimental results on the semantic labelling and robust depth estimation tasks demonstrate that the proposed method compare favorably against both baseline and state-of-the-art methods. In particular, we show that although the proposed deep CRF model is continuously valued, with the equipment of task-specific loss, it achieves impressive results even on discrete labelling tasks

    Depth estimation and semantic segmentation from a single RGB image using a hybrid convolutional neural network

    Get PDF
    Semantic segmentation and depth estimation are two important tasks in computer vision, and many methods have been developed to tackle them. Commonly these two tasks are addressed independently, but recently the idea of merging these two problems into a sole framework has been studied under the assumption that integrating two highly correlated tasks may benefit each other to improve the estimation accuracy. In this paper, depth estimation and semantic segmentation are jointly addressed using a single RGB input image under a unified convolutional neural network. We analyze two different architectures to evaluate which features are more relevant when shared by the two tasks and which features should be kept separated to achieve a mutual improvement. Likewise, our approaches are evaluated under two different scenarios designed to review our results versus single-task and multi-task methods. Qualitative and quantitative experiments demonstrate that the performance of our methodology outperforms the state of the art on single-task approaches, while obtaining competitive results compared with other multi-task methods.Peer ReviewedPostprint (author's final draft

    2D+3D Indoor Scene Understanding from a Single Monocular Image

    Get PDF
    Scene understanding, as a broad field encompassing many subtopics, has gained great interest in recent years. Among these subtopics, indoor scene understanding, having its own specific attributes and challenges compared to outdoor scene under- standing, has drawn a lot of attention. It has potential applications in a wide variety of domains, such as robotic navigation, object grasping for personal robotics, augmented reality, etc. To our knowledge, existing research for indoor scenes typically makes use of depth sensors, such as Kinect, that is however not always available. In this thesis, we focused on addressing the indoor scene understanding tasks in a general case, where only a monocular color image of the scene is available. Specifically, we first studied the problem of estimating a detailed depth map from a monocular image. Then, benefiting from deep-learning-based depth estimation, we tackled the higher-level tasks of 3D box proposal generation, and scene parsing with instance segmentation, semantic labeling and support relationship inference from a monocular image. Our research on indoor scene understanding provides a comprehensive scene interpretation at various perspectives and scales. For monocular image depth estimation, previous approaches are limited in that they only reason about depth locally on a single scale, and do not utilize the important information of geometric scene structures. Here, we developed a novel graphical model, which reasons about detailed depth while leveraging geometric scene structures at multiple scales. For 3D box proposals, to our best knowledge, our approach constitutes the first attempt to reason about class-independent 3D box proposals from a single monocular image. To this end, we developed a novel integrated, differentiable framework that estimates depth, extracts a volumetric scene representation and generates 3D proposals. At the core of this framework lies a novel residual, differentiable truncated signed distance function module, which is able to handle the relatively low accuracy of the predicted depth map. For scene parsing, we tackled its three subtasks of instance segmentation, se- mantic labeling, and the support relationship inference on instances. Existing work typically reasons about these individual subtasks independently. Here, we leverage the fact that they bear strong connections, which can facilitate addressing these sub- tasks if modeled properly. To this end, we developed an integrated graphical model that reasons about the mutual relationships of the above subtasks. In summary, in this thesis, we introduced novel and effective methodologies for each of three indoor scene understanding tasks, i.e., depth estimation, 3D box proposal generation, and scene parsing, and exploited the dependencies on depth estimates of the latter two tasks. Evaluation on several benchmark datasets demonstrated the effectiveness of our algorithms and the benefits of utilizing depth estimates for higher-level tasks

    Probabilistic techniques in semantic mapping for mobile robotics

    Get PDF
    Los mapas semánticos son representaciones del mundo que permiten a un robot entender no sólo los aspectos espaciales de su lugar de trabajo, sino también el significado de sus elementos (objetos, habitaciones, etc.) y como los humanos interactúan con ellos (e.g. funcionalidades, eventos y relaciones). Para conseguirlo, un mapa semántico añade a las representaciones puramente espaciales, tales como mapas geométricos o topológicos, meta-información sobre los tipos de elementos y relaciones que pueden encontrarse en el entorno de trabajo. Esta meta-información, denominada conocimiento semántico o de sentido común, se codifica típicamente en Bases de Conocimiento. Un ejemplo de este tipo de información podría ser: "los frigoríficos son objetos grandes, con forma rectangular, colocados normalmente en las cocinas, y que pueden contener comida perecedera y medicación". Codificar y manejar este conocimiento semántico permite al robot razonar acerca de la información obtenida de un cierto lugar de trabajo, así como inferir nueva información con el fin de ejecutar eficientemente tareas de alto nivel como "¡hola robot! llévale la medicación a la abuela, por favor". La presente tesis propone la utilización de técnicas probabilísticas para construir y mantener mapas semánticos, lo cual presenta tres ventajas principales en comparación con los enfoques tradicionales: i) permite manejar incertidumbre (proveniente de los sensores imprecisos del robot y de los modelos empleados), ii) provee representaciones del entorno coherentes por medio del aprovechamiento de las relaciones contextuales entre los elementos observados (e.g. los frigoríficos usualmente se encuentran en las cocinas) desde un punto de vista holístico, y iii) produce valores de certidumbre que reflejan el grado de exactitud de la comprensión del robot acerca de su entorno. Específicamente, las contribuciones presentadas pueden agruparse en dos temas principales. El primer conjunto de contribuciones se basa en el problema del reconocimiento de objetos y/o habitaciones, ya que los sistemas de mapeo semántico deben contar con algoritmos de reconocimiento fiables para la construcción de representaciones válidas. Para ello se ha explorado la utilización de Modelos Gráficos Probabilísticos (Probabilistic Graphical Models o PGMs en inglés) con el fin de aprovechar las relaciones de contexto entre objetos y/o habitaciones a la vez que se maneja la incertidumbre inherente al problema de reconocimiento, y el empleo de Bases de Conocimiento para mejorar su desempeño de distintos modos, e.g., detectando resultados incoherentes, proveyendo información a priori, reduciendo la complejidad de los algoritmos de inferencia probabilística, generando ejemplos de entrenamiento sintéticos, habilitando el aprendizaje a partir de experiencias pasadas, etc. El segundo grupo de contribuciones acomoda los resultados probabilísticos provenientes de los algoritmos de reconocimiento desarrollados en una nueva representación semántica, denominada Multiversal Semantic Map (MvSmap). Este mapa gestiona múltiples interpretaciones del espacio de trabajo del robot, llamadas universos, los cuales son anotados con la probabilidad de ser los correctos de acuerdo con el conocimiento actual del robot. Así, este enfoque proporciona una creencia fundamentada sobre la exactitud de la comprensión del robot sobre su entorno, lo que le permite operar de una manera más eficiente y coherente. Los algoritmos probabilísticos propuestos han sido testeados concienzudamente y comparados con otros enfoques actuales e innovadores empleando conjuntos de datos del estado del arte. De manera adicional, esta tesis también contribuye con dos conjuntos de datos, UMA-Offices and Robot@Home, los cuales contienen información sensorial capturada en distintos entornos de oficinas y casas, así como dos herramientas software, la librería Undirected Probabilistic Graphical Models in C++ (UPGMpp), y el conjunto de herramientas Object Labeling Toolkit (OLT), para el trabajo con Modelos Gráficos Probabilísticos y el procesamiento de conjuntos de datos respectivamente

    Segmentación multi-modal de imágenes RGB-D a partir de mapas de apariencia y de profundidad geométrica

    Get PDF
    Classical image segmentation algorithms exploit the detection of similarities and discontinuities of different visual cues to define and differentiate multiple regions of interest in images. However, due to the high variability and uncertainty of image data, producing accurate results is difficult. In other words, segmentation based just on color is often insufficient for a large percentage of real-life scenes. This work presents a novel multi-modal segmentation strategy that integrates depth and appearance cues from RGB-D images by building a hierarchical region-based representation, i.e., a multi-modal segmentation tree (MM-tree). For this purpose, RGB-D image pairs are represented in a complementary fashion by different segmentation maps. Based on color images, a color segmentation tree (C-tree) is created to obtain segmented and over-segmented maps. From depth images, two independent segmentation maps are derived by computing planar and 3D edge primitives. Then, an iterative region merging process can be used to locally group the previously obtained maps into the MM-tree. Finally, the top emerging MM-tree level coherently integrates the available information from depth and appearance maps. The experiments were conducted using the NYU-Depth V2 RGB-D dataset, which demonstrated the competitive results of our strategy compared to state-of-the-art segmentation methods. Specifically, using test images, our method reached average scores of 0.56 in Segmentation Covering and 2.13 in Variation of Information.Los algoritmos clásicos de segmentación de imágenes explotan la detección de similitudes y discontinuidades en diferentes señales visuales, para definir regiones de interés en imágenes. Sin embargo, debido a la alta variabilidad e incertidumbre en los datos de imagen, se dificulta generar resultados acertados. En otras palabras, la segmentación basada solo en color a menudo no es suficiente para un gran porcentaje de escenas reales. Este trabajo presenta una nueva estrategia de segmentación multi-modal que integra señales de profundidad y apariencia desde imágenes RGB-D, por medio de una representación jerárquica basada en regiones, es decir, un árbol de segmentación multi-modal (MM-tree). Para ello, la imagen RGB-D es descrita de manera complementaria por diferentes mapas de segmentación. A partir de la imagen de color, se implementa un árbol de segmentación de color (C-tree) para obtener mapas de segmentación y sobre-segmentación. Desde de la imagen de profundidad, se derivan dos mapas de segmentación independientes, los cuales se basan en el cálculo de primitivas de planos y de bordes 3D. Seguidamente, un proceso de fusión jerárquico de regiones permite agrupar de manera local los mapas obtenidos anteriormente en el MM-tree. Por último, el nivel superior emergente del MM-tree integra coherentemente la información disponible en los mapas de profundidad y apariencia. Los experimentos se realizaron con el conjunto de imágenes RGB-D del NYU-Depth V2, evidenciando resultados competitivos, con respecto a los métodos de segmentación del estado del arte. Específicamente, en las imágenes de prueba, se obtuvieron puntajes promedio de 0.56 en la medida de Segmentation Covering y 2.13 en Variation of Information

    Enhancing semantic segmentation with detection priors and iterated graph cuts for robotics

    Get PDF
    To foster human\u2013robot interaction, autonomous robots need to understand the environment in which they operate. In this context, one of the main challenges is semantic segmentation, together with the recognition of important objects, which can aid robots during exploration, as well as when planning new actions and interacting with the environment. In this study, we extend a multi-view semantic segmentation system based on 3D Entangled Forests (3DEF) by integrating and refining two object detectors, Mask R-CNN and You Only Look Once (YOLO), with Bayesian fusion and iterated graph cuts. The new system takes the best of its components, successfully exploiting both 2D and 3D data. Our experiments show that our approach is competitive with the state-of-the-art and leads to accurate semantic segmentations
    • …
    corecore