Search CORE

5,525 research outputs found

Two-Stream Convolutional Networks for Action Recognition in Videos

Author: Simonyan Karen
Zisserman Andrew
Publication venue
Publication date: 12/11/2014
Field of study

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification

arXiv.org e-Print Archive

Oxford University Research Archive

Cortical spatio-temporal dimensionality reduction for visual grouping

Author: Barbieri Davide
Citti Giovanna
Cocci Giacomo
Sarti Alessandro
Publication venue
Publication date: 03/10/2014
Field of study

The visual systems of many mammals, including humans, is able to integrate the geometric information of visual stimuli and to perform cognitive tasks already at the first stages of the cortical processing. This is thought to be the result of a combination of mechanisms, which include feature extraction at single cell level and geometric processing by means of cells connectivity. We present a geometric model of such connectivities in the space of detected features associated to spatio-temporal visual stimuli, and show how they can be used to obtain low-level object segmentation. The main idea is that of defining a spectral clustering procedure with anisotropic affinities over datasets consisting of embeddings of the visual stimuli into higher dimensional spaces. Neural plausibility of the proposed arguments will be discussed

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Biblos-e Archivo

Delving Deeper into Convolutional Networks for Learning Video Representations

Author: Ballas Nicolas
Courville Aaron
Pal Chris
Yao Li
Publication venue
Publication date: 01/01/2016
Field of study

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. Using low-level percepts can leads to high-dimensionality video representations. To mitigate this effect and control the model number of parameters, we introduce a variant of the GRU model that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations. We empirically validate our approach on both Human Action Recognition and Video Captioning tasks. In particular, we achieve results equivalent to state-of-art on the YouTube2Text dataset using a simpler text-decoder model and without extra 3D CNN features.Comment: ICLR 201

arXiv.org e-Print Archive

PolyPublie

Recommended from our members

Visualisation of Origins, Destinations and Flows with OD Maps

Author: Aidan Slingsby
Andrienko G.
Bertin J
Cui W.
Gilbert M.
Guo D
Guo D
Guo D.
Guo D.
Hernandez T
Holten D.
Jarvis R.
Jason Dykes
Jo Wood
Openshaw S
Paci R.
Slingsby A.
Tobler W
Wilkinson L.
Yi J. S.
Publication venue: 'Maney Publishing'
Publication date: 01/05/2010
Field of study

We present a new technique for the visual exploration of origins (O) and destinations (D) arranged in geographic space. Previous attempts to map the flows between origins and destinations have suffered from problems of occlusion usually requiring some form of generalisation, such as aggregation or flow density estimation before they can be visualized. This can lead to loss of detail or the introduction of arbitrary artefacts in the visual representation. Here, we propose mapping OD vectors as cells rather than lines, comparable with the process of constructing OD matrices, but unlike the OD matrix, we preserve the spatial layout of all origin and destination locations by constructing a gridded two‐level spatial treemap. The result is a set of spatially ordered small multiples upon which any arbitrary geographic data may be projected. Using a hash grid spatial data structure, we explore the characteristics of the technique through a software prototype that allows interactive query and visualisation of 105‐106 simulated and recorded OD vectors. The technique is illustrated using US county to county migration and commuting statistics

City Research Online

Crossref

UCL Discovery

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Author: de Melo Gerard
Gan Chuang
Liu Xiao
Long Xiang
Wen Shilei
Wu Jiajun
Publication venue
Publication date: 27/11/2017
Field of study

Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract temporal patterns from videos, so as to improve the accuracy of video classification. In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets. We investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture more diverse signals. We carefully analyze and compare the effect of different attention mechanisms, cluster sizes, and the use of the shifting operation, and also investigate the combination of attention clusters for multimodal integration. We demonstrate the effectiveness of our framework on three real-world video classification datasets. Our model achieves competitive results across all of these. In particular, on the large-scale Kinetics dataset, our framework obtains an excellent single model accuracy of 79.4% in terms of the top-1 and 94.0% in terms of the top-5 accuracy on the validation set. The attention clusters are the backbone of our winner solution at ActivityNet Kinetics Challenge 2017. Code and models will be released soon.Comment: The backbone of the winner solution at ActivityNet Kinetics Challenge 201

arXiv.org e-Print Archive

Crossref

Directional Dense-Trajectory-based Patterns for Dynamic Texture Recognition

Author: Bouchara Frédéric
Nguyen Thanh Phuong
Nguyen Thanh Tuan
Publication venue: 'Society for Leukocyte Biology'
Publication date: 01/01/2020
Field of study

International audienceRepresentation of dynamic textures (DTs), well-known as a sequence of moving textures, is a challenging problem in video analysis due to disorientation of motion features. Analyzing DTs to make them "under-standable" plays an important role in different applications of computer vision. In this paper, an efficient approach for DT description is proposed by addressing the following novel concepts. First, beneficial properties of dense trajectories are exploited for the first time to efficiently describe DTs instead of the whole video. Second, two substantial extensions of Local Vector Pattern operator are introduced to form a completed model which is based on complemented components to enhance its performance in encoding directional features of motion points in a trajectory. Finally, we present a new framework, called Directional Dense Trajectory Patterns , which takes advantage of directional beams of dense trajectories along with spatio-temporal features of their motion points in order to construct dense-trajectory-based descriptors with more robustness. Evaluations of DT recognition on different benchmark datasets (i.e., UCLA, DynTex, and DynTex++) have verified the interest of our proposal

Crossref

HAL AMU

Directory of Open Access Journals

Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition

Author: Hu W.
Li X.
Ling H.
Maybank Stephen J.
Yuan C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

In this paper, we present a new geometric-temporal representation for visual action recognition based on local spatio-temporal features. First, we propose a modified covariance descriptor under the log-Euclidean Riemannian metric to represent the spatio-temporal cuboids detected in the video sequences. Compared with previously proposed covariance descriptors, our descriptor can be measured and clustered in Euclidian space. Second, to capture the geometric-temporal contextual information, we construct a directional pyramid co-occurrence matrix (DPCM) to describe the spatio-temporal distribution of the vector-quantized local feature descriptors extracted from a video. DPCM characterizes the co-occurrence statistics of local features as well as the spatio-temporal positional relationships among the concurrent features. These statistics provide strong descriptive power for action recognition. To use DPCM for action recognition, we propose a directional pyramid co-occurrence matching kernel to measure the similarity of videos. The proposed method achieves the state-of-the-art performance and improves on the recognition performance of the bag-of-visual-words (BOVWs) models by a large margin on six public data sets. For example, on the KTH data set, it achieves 98.78% accuracy while the BOVW approach only achieves 88.06%. On both Weizmann and UCF CIL data sets, the highest possible accuracy of 100% is achieved

Crossref

Adelaide Research & Scholarship

Birkbeck Institutional Research Online