9,238 research outputs found

    Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events

    Full text link
    As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD.Comment: To be published as an oral paper in Proceedings of the 28th ACM International Conference on Multimedia (ACM MM '20). 9 pages, 7 figure

    Streaming and User Behaviour in Omnidirectional Videos

    Get PDF
    Omnidirectional videos (ODVs) have gone beyond the passive paradigm of traditional video, offering higher degrees of immersion and interaction. The revolutionary novelty of this technology is the possibility for users to interact with the surrounding environment, and to feel a sense of engagement and presence in a virtual space. Users are clearly the main driving force of immersive applications and consequentially the services need to be properly tailored to them. In this context, this chapter highlights the importance of the new role of users in ODV streaming applications, and thus the need for understanding their behaviour while navigating within ODVs. A comprehensive overview of the research efforts aimed at advancing ODV streaming systems is also presented. In particular, the state-of-the-art solutions under examination in this chapter are distinguished in terms of system-centric and user-centric streaming approaches: the former approach comes from a quite straightforward extension of well-established solutions for the 2D video pipeline while the latter one takes the benefit of understanding users’ behaviour and enable more personalised ODV streaming

    Probabilistic Models of Motor Production

    Get PDF
    N. Bernstein defined the ability of the central neural system (CNS) to control many degrees of freedom of a physical body with all its redundancy and flexibility as the main problem in motor control. He pointed at that man-made mechanisms usually have one, sometimes two degrees of freedom (DOF); when the number of DOF increases further, it becomes prohibitively hard to control them. The brain, however, seems to perform such control effortlessly. He suggested the way the brain might deal with it: when a motor skill is being acquired, the brain artificially limits the degrees of freedoms, leaving only one or two. As the skill level increases, the brain gradually "frees" the previously fixed DOF, applying control when needed and in directions which have to be corrected, eventually arriving to the control scheme where all the DOF are "free". This approach of reducing the dimensionality of motor control remains relevant even today. One the possibles solutions of the Bernstetin's problem is the hypothesis of motor primitives (MPs) - small building blocks that constitute complex movements and facilitite motor learnirng and task completion. Just like in the visual system, having a homogenious hierarchical architecture built of similar computational elements may be beneficial. Studying such a complicated object as brain, it is important to define at which level of details one works and which questions one aims to answer. David Marr suggested three levels of analysis: 1. computational, analysing which problem the system solves; 2. algorithmic, questioning which representation the system uses and which computations it performs; 3. implementational, finding how such computations are performed by neurons in the brain. In this thesis we stay at the first two levels, seeking for the basic representation of motor output. In this work we present a new model of motor primitives that comprises multiple interacting latent dynamical systems, and give it a full Bayesian treatment. Modelling within the Bayesian framework, in my opinion, must become the new standard in hypothesis testing in neuroscience. Only the Bayesian framework gives us guarantees when dealing with the inevitable plethora of hidden variables and uncertainty. The special type of coupling of dynamical systems we proposed, based on the Product of Experts, has many natural interpretations in the Bayesian framework. If the dynamical systems run in parallel, it yields Bayesian cue integration. If they are organized hierarchically due to serial coupling, we get hierarchical priors over the dynamics. If one of the dynamical systems represents sensory state, we arrive to the sensory-motor primitives. The compact representation that follows from the variational treatment allows learning of a motor primitives library. Learned separately, combined motion can be represented as a matrix of coupling values. We performed a set of experiments to compare different models of motor primitives. In a series of 2-alternative forced choice (2AFC) experiments participants were discriminating natural and synthesised movements, thus running a graphics Turing test. When available, Bayesian model score predicted the naturalness of the perceived movements. For simple movements, like walking, Bayesian model comparison and psychophysics tests indicate that one dynamical system is sufficient to describe the data. For more complex movements, like walking and waving, motion can be better represented as a set of coupled dynamical systems. We also experimentally confirmed that Bayesian treatment of model learning on motion data is superior to the simple point estimate of latent parameters. Experiments with non-periodic movements show that they do not benefit from more complex latent dynamics, despite having high kinematic complexity. By having a fully Bayesian models, we could quantitatively disentangle the influence of motion dynamics and pose on the perception of naturalness. We confirmed that rich and correct dynamics is more important than the kinematic representation. There are numerous further directions of research. In the models we devised, for multiple parts, even though the latent dynamics was factorized on a set of interacting systems, the kinematic parts were completely independent. Thus, interaction between the kinematic parts could be mediated only by the latent dynamics interactions. A more flexible model would allow a dense interaction on the kinematic level too. Another important problem relates to the representation of time in Markov chains. Discrete time Markov chains form an approximation to continuous dynamics. As time step is assumed to be fixed, we face with the problem of time step selection. Time is also not a explicit parameter in Markov chains. This also prohibits explicit optimization of time as parameter and reasoning (inference) about it. For example, in optimal control boundary conditions are usually set at exact time points, which is not an ecological scenario, where time is usually a parameter of optimization. Making time an explicit parameter in dynamics may alleviate this

    Simple and Complex Human Action Recognition in Constrained and Unconstrained Videos

    Get PDF
    Human action recognition plays a crucial role in visual learning applications such as video understanding and surveillance, video retrieval, human-computer interactions, and autonomous driving systems. A variety of methodologies have been proposed for human action recognition via developing of low-level features along with the bag-of-visual-word models. However, much less research has been performed on the compound of pre-processing, encoding and classification stages. This dissertation focuses on enhancing the action recognition performances via ensemble learning, hybrid classifier, hierarchical feature representation, and key action perception methodologies. Action variation is one of the crucial challenges in video analysis and action recognition. We address this problem by proposing the hybrid classifier (HC) to discriminate actions which contain similar forms of motion features such as walking, running, and jogging. Aside from that, we show and proof that the fusion of various appearance-based and motion features can boost the simple and complex action recognition performance. The next part of the dissertation introduces pooled-feature representation (PFR) which is derived from a double phase encoding framework (DPE). Considering that a given unconstrained video is composed of a sequence of simple frames, the first phase of DPE generates temporal sub-volumes from the video and represents them individually by employing the proposed improved rank pooling (IRP) method. The second phase constructs the pool of features by fusing the represented vectors from the first phase. The pool is compressed and then encoded to provide video-parts vector (VPV). The DPE framework allows distilling the video representation and hierarchically extracting new information. Compared with recent video encoding approaches, VPV can preserve the higher-level information through standard encoding of low-level features in two phases. Furthermore, the encoded vectors from both phases of DPE are fused along with a compression stage to develop PFR

    Feature Extraction and Recognition for Human Action Recognition

    Get PDF
    How to automatically label videos containing human motions is the task of human action recognition. Traditional human action recognition algorithms use the RGB videos as input, and it is a challenging task because of the large intra-class variations of actions, cluttered background, possible camera movement, and illumination variations. Recently, the introduction of cost-effective depth cameras provides a new possibility to address difficult issues. However, it also brings new challenges such as noisy depth maps and time alignment. In this dissertation, effective and computationally efficient feature extraction and recognition algorithms are proposed for human action recognition. At the feature extraction step, two novel spatial-temporal feature descriptors are proposed which can be combined with local feature detectors. The first proposed descriptor is the Shape and Motion Local Ternary Pattern (SMltp) descriptor which can dramatically reduced the number of features generated by dense sampling without sacrificing the accuracy. In addition, the Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) descriptor is proposed, which describes the spatial and temporal gradients-like features. Both descriptors (SMltp and CS-Mltp) take advantage of the Local Binary Pattern (LBP) texture operator in terms of tolerance to illumination change, robustness in homogeneous region and computational efficiency. For better feature representation, this dissertation presents a new Dictionary Learning (DL) method to learn an overcomplete set of representative vectors (atoms) so that any input feature can be approximated by a linear combination of these atoms with minimum reconstruction error. Instead of simultaneously learning one overcomplete dictionary for all classes, we learn class-specific sub-dictionaries to increase the discrimination. In addition, the group sparsity and the geometry constraint are added to the learning process to further increase the discriminative power, so that features are well reconstructed by atoms from the same class and features from the same class with high similarity will be forced to have similar coefficients. To evaluate the proposed algorithms, three applications including single view action recognition, distributed multi-view action recognition, and RGB-D action recognition have been explored. Experimental results on benchmark datasets and comparative analyses with the state-of-the-art methods show the effectiveness and merits of the proposed algorithms
    • …
    corecore