4,945 research outputs found

    Deep Interactive Region Segmentation and Captioning

    Full text link
    With recent innovations in dense image captioning, it is now possible to describe every object of the scene with a caption while objects are determined by bounding boxes. However, interpretation of such an output is not trivial due to the existence of many overlapping bounding boxes. Furthermore, in current captioning frameworks, the user is not able to involve personal preferences to exclude out of interest areas. In this paper, we propose a novel hybrid deep learning architecture for interactive region segmentation and captioning where the user is able to specify an arbitrary region of the image that should be processed. To this end, a dedicated Fully Convolutional Network (FCN) named Lyncean FCN (LFCN) is trained using our special training data to isolate the User Intention Region (UIR) as the output of an efficient segmentation. In parallel, a dense image captioning model is utilized to provide a wide variety of captions for that region. Then, the UIR will be explained with the caption of the best match bounding box. To the best of our knowledge, this is the first work that provides such a comprehensive output. Our experiments show the superiority of the proposed approach over state-of-the-art interactive segmentation methods on several well-known datasets. In addition, replacement of the bounding boxes with the result of the interactive segmentation leads to a better understanding of the dense image captioning output as well as accuracy enhancement for the object detection in terms of Intersection over Union (IoU).Comment: 17, pages, 9 figure

    SceneFlowFields: Dense Interpolation of Sparse Scene Flow Correspondences

    Full text link
    While most scene flow methods use either variational optimization or a strong rigid motion assumption, we show for the first time that scene flow can also be estimated by dense interpolation of sparse matches. To this end, we find sparse matches across two stereo image pairs that are detected without any prior regularization and perform dense interpolation preserving geometric and motion boundaries by using edge information. A few iterations of variational energy minimization are performed to refine our results, which are thoroughly evaluated on the KITTI benchmark and additionally compared to state-of-the-art on MPI Sintel. For application in an automotive context, we further show that an optional ego-motion model helps to boost performance and blends smoothly into our approach to produce a segmentation of the scene into static and dynamic parts.Comment: IEEE Winter Conference on Applications of Computer Vision (WACV), 201

    Deep Multimodal Fusion Networks for Semantic Segmentation

    Get PDF
    Deep Convolutional Neural Networks (CNN) have emerged as the dominant approach for solving many problems in computer vision and perception. Most state-of-the-art approaches to these problems are designed to operate using RGB color images as input. However, there are several settings that CNNs have been deployed in where more information about the state of the environment is available. Systems that require real-time perception such as self-driving cars or drones are typically equipped with sensors that can offer several complementary representations of the environment. CNNs designed to take advantage of features extracted from multiple sensor modalities have the potential to increase the perception capability of autonomous systems. The work in this thesis extends the real-time CNN segmentation model ENet [39] to learn using a multimodal representation of the environment. Namely we investigate learning from disparity images generated by SGM [20] in conjunction with RGB color images from the Cityscapes dataset [10]. To do this we create a network architecture called MM-ENet composed of two symmetric feature extraction branches followed by a modality fusion network. We avoid using depth encoding strategies such as HHA [15] due to their computational cost and instead operate on raw disparity images. We also constrain the resolution of training and testing images to be relatively small, downsampled by a factor of four with respect to the original Cityscapes resolution. This is because a deployed version of this system must also run SGM in real-time, which could become a computational bottleneck if using higher resolution images. To design the best model for this task, we train several architectures that differ with respect to the operation used to combine features: channel-wise concatenation, element-wise multiplication, and element-wise addition. We evaluate all models using Intersection-over-Union (IoU) as a primary performance metric and the instance-level IoU (iIoU) as a secondary metric. Compared to the baseline ENet model, we achieve comparable segmentation performance and are also able to take advantage of features that cannot be extracted from RGB images alone. The results show that at this particular fusion location, elementwise multiplication is the best overall modality combination method. Through observing feature activations at different points in the network we show that depth information helps the network reason about object edges and boundaries that are not as salient in color space, particularly with respect to spatially small object classes such as persons. We also present results that suggest that even though each branch learned to extract unique features, these features can have complementary properties. By extending the ENet model to learn from multimodal data we provide it with a richer representation of the environment. Because this extension simply duplicates layers in the encoder to create two symmetric feature extraction branches, the network also maintains real-time inference performance. Due to the network being trained on smaller resolution images to remain within the constraints of an embedded system, the overall performance is competitive but below state-of-the-art models reported on the Cityscapes leaderboard. When deploying this model in a high-performance system such as an autonomous vehicle that has the ability to generate disparity maps in real-time at a high resolution, MM-ENet can take advantage of unused data modalities to improve overall performance on semantic segmentation

    ConceptFusion: Open-set Multimodal 3D Mapping

    Full text link
    Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiD
    corecore