22,992 research outputs found

    Video Image Segmentation and Object Detection Using Markov Random Field Model

    Get PDF
    In this dissertation, the problem of video object detection has been addressed. Initially this is accomplished by the existing method of temporal segmentation. It has been observed that the Video Object Plane (VOP) generated by temporal segmentation has a strong limitation in the sense that for slow moving video object it exhibits either poor performance or fails. Therefore, the problem of object detection is addressed in case of slow moving video objects and fast moving video objects as well. The object is detected while integrating the spatial segmentation as well as temporal segmentation. In order to take care of the temporal pixel distribution in to account for spatial segmentation of frames, the spatial segmentation of frames has been formulated in spatio-temporal framework. A compound MRF model is proposed to model the video sequence. This model takes care of the spatial and the temporal distributions as well. Besides taking in to account the pixel distributions in temporal directions, compound MRF models have been proposed to model the edges in the temporal direction. This model has been named as edgebased model. Further more the differences in the successive images have been modeled by MRF and this is called as the change based model. This change based model enhanced the performance of the proposed scheme. The spatial segmentation problem is formulated as a pixel labeling problem in spatio-temporal framework. The pixel labels estimation problem is formulated using Maximum a posteriori (MAP) criterion. The segmentation is achieved in supervised mode where we have selected the model parameters in a trial and error basis. The MAP estimates of the labels have been obtained by a proposed Hybrid Algorithm is devised by integrating that global as well as local convergent criterion. Temporal segmentation of frames have been obtained where we do not assume to have the availability of reference frame. The spatial and temporal segmentation have been integrated to obtain the Video Object Plane (VOP) and hence object detection In order to reduce the computational burden an evolutionary approach based scheme has been proposed. In this scheme the first frame is segmented and segmentation of other frames are obtained using the segmentation of the first frame. The computational burden is much less as compared to the previous proposed scheme. Entropy based adaptive thresholding scheme is proposed to enhance the accuracy of temporal segmentation. The object detection is achieved by integrating spatial as well as the improved temporal segmentation results

    Non-Parametric Probabilistic Image Segmentation

    Get PDF
    We propose a simple probabilistic generative model for image segmentation. Like other probabilistic algorithms (such as EM on a Mixture of Gaussians) the proposed model is principled, provides both hard and probabilistic cluster assignments, as well as the ability to naturally incorporate prior knowledge. While previous probabilistic approaches are restricted to parametric models of clusters (e.g., Gaussians) we eliminate this limitation. The suggested approach does not make heavy assumptions on the shape of the clusters and can thus handle complex structures. Our experiments show that the suggested approach outperforms previous work on a variety of image segmentation tasks

    Unconstrained Scene Text and Video Text Recognition for Arabic Script

    Full text link
    Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesising millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.Comment: 5 page

    Hierarchical Attention Network for Action Segmentation

    Full text link
    The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets, and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.Comment: Published in Pattern Recognition Letter

    Object-based 2D-to-3D video conversion for effective stereoscopic content generation in 3D-TV applications

    Get PDF
    Three-dimensional television (3D-TV) has gained increasing popularity in the broadcasting domain, as it enables enhanced viewing experiences in comparison to conventional two-dimensional (2D) TV. However, its application has been constrained due to the lack of essential contents, i.e., stereoscopic videos. To alleviate such content shortage, an economical and practical solution is to reuse the huge media resources that are available in monoscopic 2D and convert them to stereoscopic 3D. Although stereoscopic video can be generated from monoscopic sequences using depth measurements extracted from cues like focus blur, motion and size, the quality of the resulting video may be poor as such measurements are usually arbitrarily defined and appear inconsistent with the real scenes. To help solve this problem, a novel method for object-based stereoscopic video generation is proposed which features i) optical-flow based occlusion reasoning in determining depth ordinal, ii) object segmentation using improved region-growing from masks of determined depth layers, and iii) a hybrid depth estimation scheme using content-based matching (inside a small library of true stereo image pairs) and depth-ordinal based regularization. Comprehensive experiments have validated the effectiveness of our proposed 2D-to-3D conversion method in generating stereoscopic videos of consistent depth measurements for 3D-TV applications

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table
    corecore