728,726 research outputs found

    Appearance-and-Relation Networks for Video Classification

    Full text link
    Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.Comment: CVPR18 camera-ready version. Code & models available at https://github.com/wanglimin/ARTNe

    Learning Deep Representations of Appearance and Motion for Anomalous Event Detection

    Full text link
    We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes. While most existing works merely use hand-crafted appearance and motion features, we propose Appearance and Motion DeepNet (AMDN) which utilizes deep neural networks to automatically learn feature representations. To exploit the complementary information of both appearance and motion patterns, we introduce a novel double fusion framework, combining both the benefits of traditional early fusion and late fusion strategies. Specifically, stacked denoising autoencoders are proposed to separately learn both appearance and motion features as well as a joint representation (early fusion). Based on the learned representations, multiple one-class SVM models are used to predict the anomaly scores of each input, which are then integrated with a late fusion strategy for final anomaly detection. We evaluate the proposed method on two publicly available video surveillance datasets, showing competitive performance with respect to state of the art approaches.Comment: Oral paper in BMVC 201

    On the role of injection in kinetic approaches to nonlinear particle acceleration at non-relativistic shock waves

    Full text link
    The dynamical reaction of the particles accelerated at a shock front by the first order Fermi process can be determined within kinetic models that account for both the hydrodynamics of the shocked fluid and the transport of the accelerated particles. These models predict the appearance of multiple solutions, all physically allowed. We discuss here the role of injection in selecting the real solution, in the framework of a simple phenomenological recipe, which is a variation of what is sometimes referred to as thermal leakage. In this context we show that multiple solutions basically disappear and when they are present they are limited to rather peculiar values of the parameters. We also provide a quantitative calculation of the efficiency of particle acceleration at cosmic ray modified shocks and we identify the fraction of energy which is advected downstream and that of particles escaping the system from upstream infinity at the maximum momentum. The consequences of efficient particle acceleration for shock heating are also discussed

    Query generation from multiple media examples

    Get PDF
    This paper exploits an unified media document representation called feature terms for query generation from multiple media examples, e.g. images. A feature term refers to a value interval of a media feature. A media document is therefore represented by a frequency vector about feature term appearance. This approach (1) facilitates feature accumulation from multiple examples; (2) enables the exploration of text-based retrieval models for multimedia retrieval. Three statistical criteria, minimised chi-squared, minimised AC/DC rate and maximised entropy, are proposed to extract feature terms from a given media document collection. Two textual ranking functions, KL divergence and a BM25-like retrieval model, are adapted to estimate media document relevance. Experiments on the Corel photo collection and the TRECVid 2006 collection show the effectiveness of feature term based query in image and video retrieval

    Adaptive tracking via multiple appearance models and multiple linear searches

    Get PDF
    We introduce a unified tracker (FMCMC-MM) which adapts to changes in target appearance by combining two popular generative models: templates and histograms, maintaining multiple instances of each in an appearance pool, and enhances prediction by utilising multiple linear searches. These search directions are sparse estimates of motion direction derived from local features stored in a feature pool. Given only an initial template representation of the target, the proposed tracker can learn appearance changes in a supervised manner and generate appropriate target motions without knowing the target movement in advance. During tracking, it automatically switches between models in response to variations in target appearance, exploiting the strengths of each model component. New models are added, automatically, as necessary. The effectiveness of the approach is demonstrated using a variety of challenging video sequences. Results show that this framework outperforms existing appearance based tracking frameworks
    • …
    corecore