2,570 research outputs found

    Video indexing and summarization using motion activity

    Get PDF
    In this dissertation, video-indexing techniques using low-level motion activity characteristics and their application to video summarization are presented. The MPEG-7 motion activity feature is defined as the subjective level of activity or motion in a video segment. First, a novel psychophysical and analytical framework for automatic measurement of motion activity in compliance with its subjective perception is developed. A psychophysically sound subjective ground truth for motion activity and a test-set of video clips is constructed for this purpose. A number of low-level, compressed domain motion vector based, known and novel descriptors are then described. It is shown that these descriptors successfully estimate the subjective level of motion activity of video clips. Furthermore, the individual strengths and limitations of the proposed descriptors are determined using a novel pair wise comparison framework. It is verified that the intensity of motion activity descriptor of the MPEG-7 standard is one of the best performers, while a novel descriptor proposed in this dissertation performs comparably or better. A new descriptor for the spatial distribution of motion activity in a scene is proposed. This descriptor is supplementary to the intensity of motion activity descriptor. The new descriptor is shown to have comparable query retrieval performance to the current spatial distribution of motion activity descriptor of the MPEG-7 standard. The insights obtained from the motion activity investigation are applied to video summarization. A novel approach to summarizing and skimming through video using motion activity is presented. The approach is based on allocation of playback time to video segments proportional to the motion activity of the segments. Low activity segments are played faster than high activity segments in such a way that a constant level of activity is maintained throughout the video. Since motion activity is a low-complexity descriptor, the proposed summarization techniques are extremely fast. The summarization techniques are successfully used on surveillance video, The proposed techniques can also be used as a preprocessing stage for more complex summarization and content analysis techniques, thus providing significant cost gains

    Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

    Get PDF
    We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video Technology. Extension of ICIP 2017 conference pape

    Indoor Activity Detection and Recognition for Sport Games Analysis

    Full text link
    Activity recognition in sport is an attractive field for computer vision research. Game, player and team analysis are of great interest and research topics within this field emerge with the goal of automated analysis. The very specific underlying rules of sports can be used as prior knowledge for the recognition task and present a constrained environment for evaluation. This paper describes recognition of single player activities in sport with special emphasis on volleyball. Starting from a per-frame player-centered activity recognition, we incorporate geometry and contextual information via an activity context descriptor that collects information about all player's activities over a certain timespan relative to the investigated player. The benefit of this context information on single player activity recognition is evaluated on our new real-life dataset presenting a total amount of almost 36k annotated frames containing 7 activity classes within 6 videos of professional volleyball games. Our incorporation of the contextual information improves the average player-centered classification performance of 77.56% by up to 18.35% on specific classes, proving that spatio-temporal context is an important clue for activity recognition.Comment: Part of the OAGM 2014 proceedings (arXiv:1404.3538

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    A Neural Network Approach to Key Frame Extraction

    Get PDF
    We present a neural network based approach to key frame extraction in the compressed domain. The proposed method is an amalgamation of both the MPEG-7 descriptors namely motion intensity descriptor and spatial activity descriptor. Shot boundary detection and block motion estimation techniques are employed prior to the extraction of the descriptors. The motion intensity (“pace of action”) is obtained using a fuzzy system that classifies the motion intensity into five categories proportional to the intensity. The spatial activity matrix determines the spatial distribution of activity (“active regions”) in a frame. A neural network is used to pick those frames as key frames which have high intensity and maximum spatial activity at the center of the frame. Results are compared against two well-known key frame extraction techniques to demonstrate the advantage and robustness of the proposed approach. Results show that the neural network approach performs much better than selecting first frame of the shot as a key frame and selecting middle frame of the shot as a key frame methods

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    Rate-Accuracy Trade-Off In Video Classification With Deep Convolutional Neural Networks

    Get PDF
    Advanced video classification systems decode video frames to derive the necessary texture and motion representations for ingestion and analysis by spatio-temporal deep convolutional neural networks (CNNs). However, when considering visual Internet-of-Things applications, surveillance systems and semantic crawlers of large video repositories, the video capture and the CNN-based semantic analysis parts do not tend to be co-located. This necessitates the transport of compressed video over networks and incurs significant overhead in bandwidth and energy consumption, thereby significantly undermining the deployment potential of such systems. In this paper, we investigate the trade-off between the encoding bitrate and the achievable accuracy of CNN-based video classification models that directly ingest AVC/H.264 and HEVC encoded videos. Instead of retaining entire compressed video bitstreams and applying complex optical flow calculations prior to CNN processing, we only retain motion vector and select texture information at significantly-reduced bitrates and apply no additional processing prior to CNN ingestion. Based on three CNN architectures and two action recognition datasets, we achieve 11%-94% saving in bitrate with marginal effect on classification accuracy. A model-based selection between multiple CNNs increases these savings further, to the point where, if up to 7% loss of accuracy can be tolerated, video classification can take place with as little as 3 kbps for the transport of the required compressed video information to the system implementing the CNN models

    SAIVT-QUT@TRECVid 2012: Interactive surveillance event detection

    Get PDF
    In this paper, we propose an approach which attempts to solve the problem of surveillance event detection, assuming that we know the definition of the events. To facilitate the discussion, we first define two concepts. The event of interest refers to the event that the user requests the system to detect; and the background activities are any other events in the video corpus. This is an unsolved problem due to many factors as listed below: 1) Occlusions and clustering: The surveillance scenes which are of significant interest at locations such as airports, railway stations, shopping centers are often crowded, where occlusions and clustering of people are frequently encountered. This significantly affects the feature extraction step, and for instance, trajectories generated by object tracking algorithms are usually not robust under such a situation. 2) The requirement for real time detection: The system should process the video fast enough in both of the feature extraction and the detection step to facilitate real time operation. 3) Massive size of the training data set: Suppose there is an event that lasts for 1 minute in a video with a frame rate of 25fps, the number of frames for this events is 60X25 = 1500. If we want to have a training data set with many positive instances of the event, the video is likely to be very large in size (i.e. hundreds of thousands of frames or more). How to handle such a large data set is a problem frequently encountered in this application. 4) Difficulty in separating the event of interest from background activities: The events of interest often co-exist with a set of background activities. Temporal groundtruth typically very ambiguous, as it does not distinguish the event of interest from a wide range of co-existing background activities. However, it is not practical to annotate the locations of the events in large amounts of video data. This problem becomes more serious in the detection of multi-agent interactions, since the location of these events can often not be constrained to within a bounding box. 5) Challenges in determining the temporal boundaries of the events: An event can occur at any arbitrary time with an arbitrary duration. The temporal segmentation of events is difficult and ambiguous, and also affected by other factors such as occlusions
    • 

    corecore