14 research outputs found

    Joint Optical Flow and Temporally Consistent Semantic Segmentation

    Full text link
    The importance and demands of visual scene understanding have been steadily increasing along with the active development of autonomous systems. Consequently, there has been a large amount of research dedicated to semantic segmentation and dense motion estimation. In this paper, we propose a method for jointly estimating optical flow and temporally consistent semantic segmentation, which closely connects these two problem domains and leverages each other. Semantic segmentation provides information on plausible physical motion to its associated pixels, and accurate pixel-level temporal correspondences enhance the accuracy of semantic segmentation in the temporal domain. We demonstrate the benefits of our approach on the KITTI benchmark, where we observe performance gains for flow and segmentation. We achieve state-of-the-art optical flow results, and outperform all published algorithms by a large margin on challenging, but crucial dynamic objects.Comment: 14 pages, Accepted for CVRSUAD workshop at ECCV 201

    Semantic Video CNNs through Representation Warping

    Full text link
    In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called NetWarp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models will be available at http://segmentation.is.tue.mpg.deComment: ICCV 201

    Video Propagation Networks

    Full text link
    We propose a technique that propagates information forward through video data. The method is conceptually simple and can be applied to tasks that require the propagation of structured information, such as semantic labels, based on video content. We propose a 'Video Propagation Network' that processes video frames in an adaptive manner. The model is applied online: it propagates information forward without the need to access future frames. In particular we combine two components, a temporal bilateral network for dense and video adaptive filtering, followed by a spatial network to refine features and increased flexibility. We present experiments on video object segmentation and semantic video segmentation and show increased performance comparing to the best previous task-specific methods, while having favorable runtime. Additionally we demonstrate our approach on an example regression task of color propagation in a grayscale video.Comment: Appearing in Computer Vision and Pattern Recognition, 2017 (CVPR'17

    Perceptual Real-Time 2D-to-3D Conversion Using Cue Fusion

    Get PDF
    We propose a system to infer binocular disparity from a monocular video stream in real-time. Different from classic reconstruction of physical depth in computer vision, we compute perceptually plausible disparity, that is numerically inaccurate, but results in a very similar overall depth impression with plausible overall layout, sharp edges, fine details and agreement between luminance and disparity. We use several simple monocular cues to estimate disparity maps and confidence maps of low spatial and temporal resolution in real-time. These are complemented by spatially-varying, appearance-dependent and class-specific disparity prior maps, learned from example stereo images. Scene classification selects this prior at runtime. Fusion of prior and cues is done by means of robust MAP inference on a dense spatio-temporal conditional random field with high spatial and temporal resolution. Using normal distributions allows this in constant-time, parallel per-pixel work. We compare our approach to previous 2D-to-3D conversion systems in terms of different metrics, as well as a user study and validate our notion of perceptually plausible disparity

    Scale-Adaptive Video Understanding.

    Full text link
    The recent rise of large-scale, diverse video data has urged a new era of high-level video understanding. It is increasingly critical for intelligent systems to extract semantics from videos. In this dissertation, we explore the use of supervoxel hierarchies as a type of video representation for high-level video understanding. The supervoxel hierarchies contain rich multiscale decompositions of video content, where various structures can be found at various levels. However, no single level of scale contains all the desired structures we need. It is essential to adaptively choose the scales for subsequent video analysis. Thus, we present a set of tools to manipulate scales in supervoxel hierarchies including both scale generation and scale selection methods. In our scale generation work, we evaluate a set of seven supervoxel methods in the context of what we consider to be a good supervoxel for video representation. We address a key limitation that has traditionally prevented supervoxel scale generation on long videos. We do so by proposing an approximation framework for streaming hierarchical scale generation that is able to generate multiscale decompositions for arbitrarily-long videos using constant memory. Subsequently, we present two scale selection methods that are able to adaptively choose the scales according to application needs. The first method flattens the entire supervoxel hierarchy into a single segmentation that overcomes the limitation induced by trivial selection of a single scale. We show that the selection can be driven by various post hoc feature criteria. The second scale selection method combines the supervoxel hierarchy with a conditional random field for the task of labeling actors and actions in videos. We formulate the scale selection problem and the video labeling problem in a joint framework. Experiments on a novel large-scale video dataset demonstrate the effectiveness of the explicit consideration of scale selection in video understanding. Aside from the computational methods, we present a visual psychophysical study to quantify how well the actor and action semantics in high-level video understanding are retained in supervoxel hierarchies. The ultimate findings suggest that some semantics are well-retained in the supervoxel hierarchies and can be used for further video analysis.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133202/1/cliangxu_1.pd

    Cartographie hybride métrique topologique et sémantique pour la navigation dans de grands environnements

    Get PDF
    Utonomous navigation is one of the most challenging tasks for mobile robots. It requires the ability to localize itself or a target and to find the best path linking both positions avoiding obstacles. Towards this goal, robots build a map of the environment that models its geometry or topology. However building such a map in large scale environments is challenging due to the large amount of data to manage and localization could become intractable. Additionally, an ever changing environment leads to fast obsolescence of the map that becomes useless. As shown in this thesis, introducing semantics in those maps dramatically improves navigation performances of robots in realistic environments. Scene parsing allows to build extremely compact semantic models of the scene that are used for fast relocalization using a graph-matching approach. They are powerful tools to understand scene and they are used to extend the map beyond perceptual limits of the robot through reasoning. Statistical analysis of those models is used to build an embryo of common sens which allows to detect labeling errors and to update the map using algorithms designed to maintain a stable model of the world despite occlusions due to dynamic objects. Finally semantics is used to select the best route to a target position according to high level criteria instead of metrical constraints, allowing intelligent navigation.La navigation autonome est l'un des plus grands challenges pour un robot autonome. Elle nécessite la capacité à localiser sa position ou celle de l'objectif et à trouver le meilleur chemin connectant les deux en évitant les obstacles. Pour cela, les robots utilisent une carte de l'environnement modélisant sa géométrie ou sa topologie. Cependant la construction d'une telle carte dans des environnements de grande dimension est ardue du fait de la quantité de données à traiter et le problème de la localisation peut devenir insoluble. De plus, un environnement changeant peut conduire à l'obsolescence rapide du modèle. Comme démontré dans cette thèse, l'ajout d'information de nature sémantique dans ces cartes améliore significativement les performances de navigation des robots dans des environnements réels. La labélisation d'image permet de construire des modèles extrêmement compacts qui sont utilisés pour la localisation rapide en utilisant une approche basée comparaison de graphes. Ils sont des outils puissants pour comprendre l'environnement et permettent d'étendre la carte au-delà des limites perceptuelles du robot. L'analyse statistique de ces modèles est utilisée pour construire un embryon de sens commun qui est ensuite utilisé pour détecter des erreurs de labélisation et pour mettre à jour la carte en utilisant des algorithmes conçus pour maintenir une représentation stable en dépits des occlusions créées par les objets dynamiques. Finalement, la sémantique est utilisées pour sélectionner le meilleur chemin vers une position cible en fonction de critères de haut niveau plutôt que métriques, autorisant une navigation intelligente
    corecore