89,488 research outputs found

    Region-based spatial and temporal image segmentation

    Get PDF
    This work discusses region-based representations for image and video sequence segmentation. It presents effective image segmentation techniques and demonstrates how these techniques may be integrated into algorithms that solve some of the motion segmentation problems. The region-based representation offers a way to perform a first level of abstraction and to reduce the number of elements to process with respect to the classical pixel-based representation. Motion segmentation is a fundamental technique for the analysis and the understanding of image sequences of real scenes. Motion segmentation 'describes' the sequence as sets of pixels moving coherently across one sequence with associated motions. This description is essential to the identification of the objects in the scene and to a more efficient manipulation of video sequences. This thesis presents a hybrid framework based on the combination of spatial and motion information for the segmentation of moving objects in image sequences accordingly with their motion. We formulate the problem as graph labelling over a region moving graph where nodes correspond coherently to moving atomic regions. This is a flexible high-level representation which individualizes moving independent objects. Starting from an over-segmentation of the image, the objects are formed by merging neighbouring regions together based on their mutual spatial and temporal similarity, taking spatial and motion information into account with the emphasis being on the second. Final segmentation is obtained by a spectral-based graph cuts approach. The initial phase for the moving object segmentation aims to reduce image noise without destroying the topological structure of the objects by anisotropic bilateral filtering. An initial spatial partition into a set of homogeneous regions is obtained by the watershed transform. Motion vector of each region is estimated by a variational approach. Next a region moving graph is constructed by a combination of normalized similarity between regions where mean intensity of the regions, gradient magnitude between regions, and motion information of the regions are considered. The motion similarity measure among regions is based on human perceptual characteristics. Finally, a spectral-based graph cut approach clusters and labels each moving region. The motion segmentation approach is based on a static image segmentation method proposed by the author of this dissertation. The main idea is to use atomic regions to guide a segmentation using the intensity and the gradient information through a similarity graph-based approach. This method produces simpler segmentations, less over-segmented and compares favourably with the state-of-the-art methods. To evaluate the segmentation results a new evaluation metric is proposed, which takes into attention the way humans perceive visual information. By incorporating spatial and motion information simultaneously in a region-based framework, we can visually obtain meaningful segmentation results. Experimental results of the proposed technique performance are given for different image sequences with or without camera motion and for still images. In the last case a comparison with the state-of-the-art approaches is made

    Joint Video and Text Parsing for Understanding Events and Answering Queries

    Full text link
    We propose a framework for parsing video and text jointly for understanding events and answering user queries. Our framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events) and causal information (causalities between events and fluents) in the video and text. The knowledge representation of our framework is based on a spatial-temporal-causal And-Or graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. We present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs and the joint parse graph. Based on the probabilistic model, we propose a joint parsing system consisting of three modules: video parsing, text parsing and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text respectively. The joint inference module produces a joint parse graph by performing matching, deduction and revision on the video and text parse graphs. The proposed framework has the following objectives: Firstly, we aim at deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; Secondly, we perform parsing and reasoning across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG representation; Thirdly, we show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where and why. We empirically evaluated our system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results
    • …
    corecore