1,572 research outputs found

    Simple and Complex Human Action Recognition in Constrained and Unconstrained Videos

    Get PDF
    Human action recognition plays a crucial role in visual learning applications such as video understanding and surveillance, video retrieval, human-computer interactions, and autonomous driving systems. A variety of methodologies have been proposed for human action recognition via developing of low-level features along with the bag-of-visual-word models. However, much less research has been performed on the compound of pre-processing, encoding and classification stages. This dissertation focuses on enhancing the action recognition performances via ensemble learning, hybrid classifier, hierarchical feature representation, and key action perception methodologies. Action variation is one of the crucial challenges in video analysis and action recognition. We address this problem by proposing the hybrid classifier (HC) to discriminate actions which contain similar forms of motion features such as walking, running, and jogging. Aside from that, we show and proof that the fusion of various appearance-based and motion features can boost the simple and complex action recognition performance. The next part of the dissertation introduces pooled-feature representation (PFR) which is derived from a double phase encoding framework (DPE). Considering that a given unconstrained video is composed of a sequence of simple frames, the first phase of DPE generates temporal sub-volumes from the video and represents them individually by employing the proposed improved rank pooling (IRP) method. The second phase constructs the pool of features by fusing the represented vectors from the first phase. The pool is compressed and then encoded to provide video-parts vector (VPV). The DPE framework allows distilling the video representation and hierarchically extracting new information. Compared with recent video encoding approaches, VPV can preserve the higher-level information through standard encoding of low-level features in two phases. Furthermore, the encoded vectors from both phases of DPE are fused along with a compression stage to develop PFR

    Compressed Video Action Recognition

    Full text link
    Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video compression (using H.264, HEVC, etc.), we propose to train a deep network directly on the compressed video. This representation has a higher information density, and we found the training to be easier. In addition, the signals in a compressed video provide free, albeit noisy, motion information. We propose novel techniques to use them effectively. Our approach is about 4.6 times faster than Res3D and 2.7 times faster than ResNet-152. On the task of action recognition, our approach outperforms all the other methods on the UCF-101, HMDB-51, and Charades dataset.Comment: CVPR 2018 (Selected for spotlight presentation

    Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

    Get PDF
    We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video Technology. Extension of ICIP 2017 conference pape

    Stereoscopic video quality assessment using binocular energy

    Get PDF
    Stereoscopic imaging is becoming increasingly popular. However, to ensure the best quality of experience, there is a need to develop more robust and accurate objective metrics for stereoscopic content quality assessment. Existing stereoscopic image and video metrics are either extensions of conventional 2D metrics (with added depth or disparity information) or are based on relatively simple perceptual models. Consequently, they tend to lack the accuracy and robustness required for stereoscopic content quality assessment. This paper introduces full-reference stereoscopic image and video quality metrics based on a Human Visual System (HVS) model incorporating important physiological findings on binocular vision. The proposed approach is based on the following three contributions. First, it introduces a novel HVS model extending previous models to include the phenomena of binocular suppression and recurrent excitation. Second, an image quality metric based on the novel HVS model is proposed. Finally, an optimised temporal pooling strategy is introduced to extend the metric to the video domain. Both image and video quality metrics are obtained via a training procedure to establish a relationship between subjective scores and objective measures of the HVS model. The metrics are evaluated using publicly available stereoscopic image/video databases as well as a new stereoscopic video database. An extensive experimental evaluation demonstrates the robustness of the proposed quality metrics. This indicates a considerable improvement with respect to the state-of-the-art with average correlations with subjective scores of 0.86 for the proposed stereoscopic image metric and 0.89 and 0.91 for the proposed stereoscopic video metrics

    Video coding for compression and content-based functionality

    Get PDF
    The lifetime of this research project has seen two dramatic developments in the area of digital video coding. The first has been the progress of compression research leading to a factor of two improvement over existing standards, much wider deployment possibilities and the development of the new international ITU-T Recommendation H.263. The second has been a radical change in the approach to video content production with the introduction of the content-based coding concept and the addition of scene composition information to the encoded bit-stream. Content-based coding is central to the latest international standards efforts from the ISO/IEC MPEG working group. This thesis reports on extensions to existing compression techniques exploiting a priori knowledge about scene content. Existing, standardised, block-based compression coding techniques were extended with work on arithmetic entropy coding and intra-block prediction. These both form part of the H.263 and MPEG-4 specifications respectively. Object-based coding techniques were developed within a collaborative simulation model, known as SIMOC, then extended with ideas on grid motion vector modelling and vector accuracy confidence estimation. An improved confidence measure for encouraging motion smoothness is proposed. Object-based coding ideas, with those from other model and layer-based coding approaches, influenced the development of content-based coding within MPEG-4. This standard made considerable progress in this newly adopted content based video coding field defining normative techniques for arbitrary shape and texture coding. The means to generate this information, the analysis problem, for the content to be coded was intentionally not specified. Further research work in this area concentrated on video segmentation and analysis techniques to exploit the benefits of content based coding for generic frame based video. The work reported here introduces the use of a clustering algorithm on raw data features for providing initial segmentation of video data and subsequent tracking of those image regions through video sequences. Collaborative video analysis frameworks from COST 21 l qual and MPEG-4, combining results from many other segmentation schemes, are also introduced

    Novelty detection and context dependent processing of sky-compass cues in the brain of the desert locust Schistocerca gregaria

    Get PDF
    NERVOUS SYSTEMS facilitate purposeful interactions between animals and their environment, based on the perceptual powers, cognition and higher motor control. Through goal-directed behavior, the animal aims to increase its advantage and minimize risk. For instance, the migratory desert locust should profit from being fast in finding a fresh habitat, thus minimizing the investment of bodily resources in locomotion as well as the risk of starvation or capture by a predator en route. Efficient solutions to this and similar tasks – be it finding your way to work, the daily foraging of worker bees or the seasonal long-range migration of monarch butterflies - strongly depend on spatial orientation in local or global frames of reference. Local settings may include visual landmarks at stable positions that can be mapped onto egocentric space and learned for orientation, e.g. to remember a short route to a source of benefit (e.g. food) that is distant or visually less salient than the landmarks. Compass signals can mediate orientation to a global reference-frame (allothetic orienation), e.g. for locomotion in a particular compass direction or to merely ensure motion along a straight line. Whilst spatial orientation is a prerequisite of doing the planned in such tasks, animal survival in general depends on the ability to adequately respond to the unexpected, i.e. to unpredicted events such as the approach of a predator or mate. The process of identifying relevant events in the outside world that are not predictable from preceding events is termed novelty detection. Yet, the definition of ‘novelty’ is highly contextual: depending on the current situation and goal, some changes may be irrelevant and remain ®undetected®. The present thesis describes neuronal representations of a compass stimulus, correlates of novelty detection and interactions between the two in the minute brain of an insect, the migratory desert locust Schistocerca gregaria. Experiments were carried out in tethered locusts with legs and wings removed. More precisely, adult male subjects in the gregarious phase (see phase theory, Uvarov 1966) that migrates in swarms across territories in North Africa and the Middle East were used. The author performed electrophysiological recordings from single neurons in the locust brain, while either the compass stimulus (Chapter I) or events in the visual scenery (Chapter II) or combinations of both (Chapter III) were being presented to the animal. Injections of a tracer through the recording electrode, visualized by means of fluorescent-dye coupling, allowed the allocation of cellular morphologies to previously described types of neuron or the characterization of novel cell types, respectively. Recordings were focused on cells of the central complex, a higher integration area in the insect brain that was shown to be involved in the visually mediated control of goal-directed locomotion. Experiments delivered insights into how representations of the compass cue are modulated in a manner suited for their integration in the control of goal-directed locomotion. In particular, an interaction between compass-signaling and novelty detection was found, corresponding to a process in which input in one sensory domain (object vision) modulates the processing of concurrent input to a different exteroceptive sensory system (compass sense). In addition to deepening the understanding of the compass network in the locust brain, the results reveal fundamental parallels to higher context-dependent processing of sensory information by the vertebrate cortex, both with respect to spatial cues and novelty detection

    No-reference image and video quality assessment: a classification and review of recent approaches

    Get PDF
    • 

    corecore