7,287 research outputs found

    Object Detection by Spatio-Temporal Analysis and Tracking of the Detected Objects in a Video with Variable Background

    Full text link
    In this paper we propose a novel approach for detecting and tracking objects in videos with variable background i.e. videos captured by moving cameras without any additional sensor. In a video captured by a moving camera, both the background and foreground are changing in each frame of the image sequence. So for these videos, modeling a single background with traditional background modeling methods is infeasible and thus the detection of actual moving object in a variable background is a challenging task. To detect actual moving object in this work, spatio-temporal blobs have been generated in each frame by spatio-temporal analysis of the image sequence using a three-dimensional Gabor filter. Then individual blobs, which are parts of one object are merged using Minimum Spanning Tree to form the moving object in the variable background. The height, width and four-bin gray-value histogram of the object are calculated as its features and an object is tracked in each frame using these features to generate the trajectories of the object through the video sequence. In this work, problem of data association during tracking is solved by Linear Assignment Problem and occlusion is handled by the application of kalman filter. The major advantage of our method over most of the existing tracking algorithms is that, the proposed method does not require initialization in the first frame or training on sample data to perform. Performance of the algorithm has been tested on benchmark videos and very satisfactory result has been achieved. The performance of the algorithm is also comparable and superior with respect to some benchmark algorithms

    Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

    Full text link
    Deep ConvNets have been shown to be effective for the task of human pose estimation from single images. However, several challenging issues arise in the video-based case such as self-occlusion, motion blur, and uncommon poses with few or no examples in training data sets. Temporal information can provide additional cues about the location of body joints and help to alleviate these issues. In this paper, we propose a deep structured model to estimate a sequence of human poses in unconstrained videos. This model can be efficiently trained in an end-to-end manner and is capable of representing appearance of body joints and their spatio-temporal relationships simultaneously. Domain knowledge about the human body is explicitly incorporated into the network providing effective priors to regularize the skeletal structure and to enforce temporal consistency. The proposed end-to-end architecture is evaluated on two widely used benchmarks (Penn Action dataset and JHMDB dataset) for video-based pose estimation. Our approach significantly outperforms the existing state-of-the-art methods.Comment: Preliminary version to appear in CVPR201

    Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN

    Full text link
    When we say a person is texting, can you tell the person is walking or sitting? Emphatically, no. In order to solve this incomplete representation problem, this paper presents a sub-action descriptor for detailed action detection. The sub-action descriptor consists of three levels: the posture, the locomotion, and the gesture level. The three levels give three sub-action categories for one action to address the representation problem. The proposed action detection model simultaneously localizes and recognizes the actions of multiple individuals in video surveillance using appearance-based temporal features with multi-CNN. The proposed approach achieved a mean average precision (mAP) of 76.6% at the frame-based and 83.5% at the video-based measurement on the new large-scale ICVL video surveillance dataset that the authors introduce and make available to the community with this paper. Extensive experiments on the benchmark KTH dataset demonstrate that the proposed approach achieved better performance, which in turn boosts the action recognition performance over the state-of-the-art. The action detection model can run at around 25 fps on the ICVL and more than 80 fps on the KTH dataset, which is suitable for real-time surveillance applications.Comment: 29 pages, 16 figure

    Human Action Recognition and Prediction: A Survey

    Full text link
    Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in the action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions

    cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey

    Full text link
    The paper gives futuristic challenges disscussed in the cvpaper.challenge. In 2015 and 2016, we thoroughly study 1,600+ papers in several conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV

    A Survey on Content-Aware Video Analysis for Sports

    Full text link
    Sports data analysis is becoming increasingly large-scale, diversified, and shared, but difficulty persists in rapidly accessing the most crucial information. Previous surveys have focused on the methodologies of sports video analysis from the spatiotemporal viewpoint instead of a content-based viewpoint, and few of these studies have considered semantics. This study develops a deeper interpretation of content-aware sports video analysis by examining the insight offered by research into the structure of content under different scenarios. On the basis of this insight, we provide an overview of the themes particularly relevant to the research on content-aware systems for broadcast sports. Specifically, we focus on the video content analysis techniques applied in sportscasts over the past decade from the perspectives of fundamentals and general review, a content hierarchical model, and trends and challenges. Content-aware analysis methods are discussed with respect to object-, event-, and context-oriented groups. In each group, the gap between sensation and content excitement must be bridged using proper strategies. In this regard, a content-aware approach is required to determine user demands. Finally, the paper summarizes the future trends and challenges for sports video analysis. We believe that our findings can advance the field of research on content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT

    A Hierarchical Deep Temporal Model for Group Activity Recognition

    Full text link
    In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity. We build a deep model to capture these dynamics based on LSTM (long-short term memory) models. To make use of these ob- servations, we present a 2-stage deep temporal model for the group activity recognition problem. In our model, a LSTM model is designed to represent action dynamics of in- dividual people in a sequence and another LSTM model is designed to aggregate human-level information for whole activity understanding. We evaluate our model over two datasets: the collective activity dataset and a new volley- ball dataset. Experimental results demonstrate that our proposed model improves group activity recognition perfor- mance with compared to baseline methods.Comment: cs.cv Accepted to CVPR 201

    A survey on trajectory clustering analysis

    Full text link
    This paper comprehensively surveys the development of trajectory clustering. Considering the critical role of trajectory data mining in modern intelligent systems for surveillance security, abnormal behavior detection, crowd behavior analysis, and traffic control, trajectory clustering has attracted growing attention. Existing trajectory clustering methods can be grouped into three categories: unsupervised, supervised and semi-supervised algorithms. In spite of achieving a certain level of development, trajectory clustering is limited in its success by complex conditions such as application scenarios and data dimensions. This paper provides a holistic understanding and deep insight into trajectory clustering, and presents a comprehensive analysis of representative methods and promising future directions

    H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions

    Full text link
    We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end on single images. We further merge and propagate information in the temporal domain to infer interactions between hand and object trajectories and recognize actions. The complete model takes as input a sequence of frames and outputs per-frame 3D hand and object pose predictions along with the estimates of object and action categories for the entire sequence. We demonstrate state-of-the-art performance of our algorithm even in comparison to the approaches that work on depth data and ground-truth annotations.Comment: CVPR 2019 (Oral

    An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos

    Full text link
    In this paper, we propose an end-to-end 3D CNN for action detection and segmentation in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D CNN features. Finally, the tube proposals of different clips are linked together and spatio-temporal action detection is performed using these linked video proposals. This top-down action detection approach explicitly relies on a set of good tube proposals to perform well and training the bounding box regression usually requires a large number of annotated samples. To remedy this, we further extend the 3D CNN to an encoder-decoder structure and formulate the localization problem as action segmentation. The foreground regions (i.e. action regions) for each frame are segmented first then the segmented foreground maps are used to generate the bounding boxes. This bottom-up approach effectively avoids tube proposal generation by leveraging the pixel-wise annotations of segmentation. The segmentation framework also can be readily applied to a general problem of video object segmentation. Extensive experiments on several video datasets demonstrate the superior performance of our approach for action detection and video object segmentation compared to the state-of-the-arts.Comment: arXiv admin note: substantial text overlap with arXiv:1703.1066
    • …
    corecore