728,726 research outputs found
Appearance-and-Relation Networks for Video Classification
Spatiotemporal feature learning in videos is a fundamental problem in
computer vision. This paper presents a new architecture, termed as
Appearance-and-Relation Network (ARTNet), to learn video representation in an
end-to-end manner. ARTNets are constructed by stacking multiple generic
building blocks, called as SMART, whose goal is to simultaneously model
appearance and relation from RGB input in a separate and explicit manner.
Specifically, SMART blocks decouple the spatiotemporal learning module into an
appearance branch for spatial modeling and a relation branch for temporal
modeling. The appearance branch is implemented based on the linear combination
of pixels or filter responses in each frame, while the relation branch is
designed based on the multiplicative interactions between pixels or filter
responses across multiple frames. We perform experiments on three action
recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART
blocks obtain an evident improvement over 3D convolutions for spatiotemporal
feature learning. Under the same training setting, ARTNets achieve superior
performance on these three datasets to the existing state-of-the-art methods.Comment: CVPR18 camera-ready version. Code & models available at
https://github.com/wanglimin/ARTNe
Learning Deep Representations of Appearance and Motion for Anomalous Event Detection
We present a novel unsupervised deep learning framework for anomalous event
detection in complex video scenes. While most existing works merely use
hand-crafted appearance and motion features, we propose Appearance and Motion
DeepNet (AMDN) which utilizes deep neural networks to automatically learn
feature representations. To exploit the complementary information of both
appearance and motion patterns, we introduce a novel double fusion framework,
combining both the benefits of traditional early fusion and late fusion
strategies. Specifically, stacked denoising autoencoders are proposed to
separately learn both appearance and motion features as well as a joint
representation (early fusion). Based on the learned representations, multiple
one-class SVM models are used to predict the anomaly scores of each input,
which are then integrated with a late fusion strategy for final anomaly
detection. We evaluate the proposed method on two publicly available video
surveillance datasets, showing competitive performance with respect to state of
the art approaches.Comment: Oral paper in BMVC 201
On the role of injection in kinetic approaches to nonlinear particle acceleration at non-relativistic shock waves
The dynamical reaction of the particles accelerated at a shock front by the
first order Fermi process can be determined within kinetic models that account
for both the hydrodynamics of the shocked fluid and the transport of the
accelerated particles. These models predict the appearance of multiple
solutions, all physically allowed. We discuss here the role of injection in
selecting the real solution, in the framework of a simple phenomenological
recipe, which is a variation of what is sometimes referred to as thermal
leakage. In this context we show that multiple solutions basically disappear
and when they are present they are limited to rather peculiar values of the
parameters. We also provide a quantitative calculation of the efficiency of
particle acceleration at cosmic ray modified shocks and we identify the
fraction of energy which is advected downstream and that of particles escaping
the system from upstream infinity at the maximum momentum. The consequences of
efficient particle acceleration for shock heating are also discussed
Query generation from multiple media examples
This paper exploits an unified media document representation called feature terms for query generation from multiple media examples, e.g. images. A feature term refers to a value interval of a media feature. A media document is therefore represented by a frequency vector about feature term appearance. This approach (1) facilitates feature accumulation from multiple examples; (2) enables the exploration of text-based retrieval models for multimedia retrieval. Three statistical criteria, minimised chi-squared, minimised AC/DC rate and maximised entropy, are proposed to extract feature terms from a given media document collection. Two textual ranking functions, KL divergence and a BM25-like retrieval model, are adapted to estimate media document relevance. Experiments on the Corel photo collection and the TRECVid 2006 collection show the effectiveness of feature term based query in image and video retrieval
Adaptive tracking via multiple appearance models and multiple linear searches
We introduce a unified tracker (FMCMC-MM) which adapts to changes in target appearance by combining two popular generative models: templates and histograms, maintaining multiple instances of each in an appearance pool, and enhances prediction by utilising multiple linear searches. These search directions are sparse estimates of motion direction derived from local features stored in a feature pool. Given only an initial template representation of the target, the proposed tracker can learn appearance changes in a supervised manner and generate appropriate target motions without knowing the target movement in advance. During tracking, it automatically switches between models in response to variations in target appearance, exploiting the strengths of each model component. New models are added, automatically, as necessary. The effectiveness of the approach is demonstrated using a variety of challenging video sequences. Results show that this framework outperforms existing appearance based tracking frameworks
- …