Search CORE

122 research outputs found

Realtime Video Classification Using Dense HOF/HOG

Author: Duta I. C.
Rostamzadeh N.
Sebe N.
Uijlings J. R. R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

ABSTRACT The current state-of-the-art in Video Classification is based on Bag-of-Words using local visual descriptors. Most commonly these are Histogram of Oriented Gradient (HOG) and Histogram of Optical Flow (HOF) descriptors. While such system is very powerful for classification, it is also computationally expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speed-ups for densely sampled HOG and HOF descriptors and release Matlab code. (2) We investigate the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method. (3) We investigate the trade-off between accuracy and computational efficiency for the video representation, using either a k-means or hierarchical k-means based visual vocabulary, a Random Forest based vocabulary or the Fisher kernel

CiteSeerX

Crossref

Edinburgh Research Explorer

Activity Recognition based on a Magnitude-Orientation Stream Network

Author: Caetano Carlos
de Melo Victor H. C.
Santos Jefersson A. dos
Schwartz William Robson
Publication venue
Publication date: 22/08/2017
Field of study

The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple nonlinear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Experimental results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation.Comment: 8 pages, SIBGRAPI 201

arXiv.org e-Print Archive

Crossref

Efficient and Effective Solutions for Video Classification

Author: Duta Ionut Cosmin
Publication venue: University of Trento
Publication date: 24/11/2017
Field of study

The aim of this PhD thesis is to make a step forward towards teaching computers to understand videos in a similar way as humans do. In this work we tackle the video classification and/or action recognition tasks. This thesis was completed in a period of transition, the research community moving from traditional approaches (such as hand-crafted descriptor extraction) to deep learning. Therefore, this thesis captures this transition period, however, unlike image classification, where the state-of-the-art results are dominated by deep learning approaches, for video classification the deep learning approaches are not so dominant. As a matter of fact, most of the current state-of-the-art results in video classification are based on a hybrid approach where the hand-crafted descriptors are combined with deep features to obtain the best performance. This is due to several factors, such as the fact that video is a more complex data as compared to an image, therefore, more difficult to model and also that the video datasets are not large enough to train deep models with effective results. The pipeline for video classification can be broken down into three main steps: feature extraction, encoding and classification. While for the classification part, the existing techniques are more mature, for feature extraction and encoding there is still a significant room for improvement. In addition to these main steps, the framework contains some pre/post processing techniques, such as feature dimensionality reduction, feature decorrelation (for instance using Principal Component Analysis - PCA) and normalization, which can influence considerably the performance of the pipeline. One of the bottlenecks of the video classification pipeline is represented by the feature extraction step, where most of the approaches are extremely computationally demanding, what makes them not suitable for real-time applications. In this thesis, we tackle this issue, propose different speed-ups to improve the computational cost and introduce a new descriptor that can capture motion information from a video without the need of computing optical flow (which is very expensive to compute). Another important component for video classification is represented by the feature encoding step, which builds the final video representation that serves as input to a classifier. During the PhD, we proposed several improvements over the standard approaches for feature encoding. We also propose a new feature encoding approach for deep feature encoding. To summarize, the main contributions of this thesis are as follows3: (1) We propose several speed-ups for descriptor extraction, providing a version for the standard video descriptors that can run in real-time. We also investigate the trade-off between accuracy and computational efficiency; (2) We provide a new descriptor for extracting information from a video, which is very efficient to compute, being able to extract motion information without the need of extracting the optical flow; (3) We investigate different improvements over the standard encoding approaches for boosting the performance of the video classification pipeline.;(4) We propose a new feature encoding approach specifically designed for encoding local deep features, providing a more robust video representation

Unitn-eprints PhD

Ordered Pooling of Optical Flow Sequences for Action Recognition

Author: Cherian Anoop
Porikli Fatih
Wang Jue
Publication venue
Publication date: 06/04/2017
Field of study

Training of Convolutional Neural Networks (CNNs) on long video sequences is computationally expensive due to the substantial memory requirements and the massive number of parameters that deep architectures demand. Early fusion of video frames is thus a standard technique, in which several consecutive frames are first agglomerated into a compact representation, and then fed into the CNN as an input sample. For this purpose, a summarization approach that represents a set of consecutive RGB frames by a single dynamic image to capture pixel dynamics is proposed recently. In this paper, we introduce a novel ordered representation of consecutive optical flow frames as an alternative and argue that this representation captures the action dynamics more effectively than RGB frames. We provide intuitions on why such a representation is better for action recognition. We validate our claims on standard benchmark datasets and demonstrate that using summaries of flow images lead to significant improvements over RGB frames while achieving accuracy comparable to the state-of-the-art on UCF101 and HMDB datasets.Comment: Accepted in WACV 201

arXiv.org e-Print Archive

Crossref

Online Action Detection

Author: De Geest Roeland
Gavves Efstratios
Ghodrati Amir
Li Zhenyang
Snoek Cees
Tuytelaars Tinne
Publication venue
Publication date: 01/01/2016
Field of study

In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.Comment: Project page: http://homes.esat.kuleuven.be/~rdegeest/OnlineActionDetection.htm

arXiv.org e-Print Archive

Crossref

UvA-DARE

International Migration, Integration and Social Cohesion online publications