44,164 research outputs found

    Optical Flow in Mostly Rigid Scenes

    Full text link
    The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPI-Sintel and KITTI-2015 benchmarks.Comment: 15 pages, 10 figures; accepted for publication at CVPR 201

    A Unified Framework for Mutual Improvement of SLAM and Semantic Segmentation

    Full text link
    This paper presents a novel framework for simultaneously implementing localization and segmentation, which are two of the most important vision-based tasks for robotics. While the goals and techniques used for them were considered to be different previously, we show that by making use of the intermediate results of the two modules, their performance can be enhanced at the same time. Our framework is able to handle both the instantaneous motion and long-term changes of instances in localization with the help of the segmentation result, which also benefits from the refined 3D pose information. We conduct experiments on various datasets, and prove that our framework works effectively on improving the precision and robustness of the two tasks and outperforms existing localization and segmentation algorithms.Comment: 7 pages, 5 figures.This work has been accepted by ICRA 2019. The demo video can be found at https://youtu.be/Bkt53dAehj

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments

    Full text link
    Simultaneous Localization and Mapping (SLAM) is considered to be a fundamental capability for intelligent mobile robots. Over the past decades, many impressed SLAM systems have been developed and achieved good performance under certain circumstances. However, some problems are still not well solved, for example, how to tackle the moving objects in the dynamic environments, how to make the robots truly understand the surroundings and accomplish advanced tasks. In this paper, a robust semantic visual SLAM towards dynamic environments named DS-SLAM is proposed. Five threads run in parallel in DS-SLAM: tracking, semantic segmentation, local mapping, loop closing, and dense semantic map creation. DS-SLAM combines semantic segmentation network with moving consistency check method to reduce the impact of dynamic objects, and thus the localization accuracy is highly improved in dynamic environments. Meanwhile, a dense semantic octo-tree map is produced, which could be employed for high-level tasks. We conduct experiments both on TUM RGB-D dataset and in the real-world environment. The results demonstrate the absolute trajectory accuracy in DS-SLAM can be improved by one order of magnitude compared with ORB-SLAM2. It is one of the state-of-the-art SLAM systems in high-dynamic environments. Now the code is available at our github: https://github.com/ivipsourcecode/DS-SLAMComment: 7 pages, accepted at the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018). Now the code is available at our github: https://github.com/ivipsourcecode/DS-SLA

    Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer

    Full text link
    Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.Comment: 10 pages in Conference on Computer Vision and Pattern Recognition (CVPR), 201

    Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments

    Full text link
    We present a self-supervised approach to ignoring "distractors" in camera images for the purposes of robustly estimating vehicle motion in cluttered urban environments. We leverage offline multi-session mapping approaches to automatically generate a per-pixel ephemerality mask and depth map for each input image, which we use to train a deep convolutional network. At run-time we use the predicted ephemerality and depth as an input to a monocular visual odometry (VO) pipeline, using either sparse features or dense photometric matching. Our approach yields metric-scale VO using only a single camera and can recover the correct egomotion even when 90% of the image is obscured by dynamic, independently moving objects. We evaluate our robust VO methods on more than 400km of driving from the Oxford RobotCar Dataset and demonstrate reduced odometry drift and significantly improved egomotion estimation in the presence of large moving vehicles in urban traffic.Comment: International Conference on Robotics and Automation (ICRA), 2018. Video summary: http://youtu.be/ebIrBn_nc-

    Lifting GIS Maps into Strong Geometric Context for Scene Understanding

    Full text link
    Contextual information can have a substantial impact on the performance of visual tasks such as semantic segmentation, object detection, and geometric estimation. Data stored in Geographic Information Systems (GIS) offers a rich source of contextual information that has been largely untapped by computer vision. We propose to leverage such information for scene understanding by combining GIS resources with large sets of unorganized photographs using Structure from Motion (SfM) techniques. We present a pipeline to quickly generate strong 3D geometric priors from 2D GIS data using SfM models aligned with minimal user input. Given an image resectioned against this model, we generate robust predictions of depth, surface normals, and semantic labels. We show that the precision of the predicted geometry is substantially more accurate other single-image depth estimation methods. We then demonstrate the utility of these contextual constraints for re-scoring pedestrian detections, and use these GIS contextual features alongside object detection score maps to improve a CRF-based semantic segmentation framework, boosting accuracy over baseline models
    corecore