84,944 research outputs found
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
A major emerging challenge is how to protect people's privacy as cameras and
computer vision are increasingly integrated into our daily lives, including in
smart devices inside homes. A potential solution is to capture and record just
the minimum amount of information needed to perform a task of interest. In this
paper, we propose a fully-coupled two-stream spatiotemporal architecture for
reliable human action recognition on extremely low resolution (e.g., 12x16
pixel) videos. We provide an efficient method to extract spatial and temporal
features and to aggregate them into a robust feature representation for an
entire action video sequence. We also consider how to incorporate high
resolution videos during training in order to build better low resolution
action recognition models. We evaluate on two publicly-available datasets,
showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
Designing Motion Representation in Videos
Motion representation plays a vital role in the vision-based human action recognition in videos. Generally, the information of a video could be divided into spatial information and temporal information. While the spatial information could be easily described by the RGB images, the design of the motion representation is yet a challenging problem. In order to design a motion representation that is efficient and effective, we design the feature according to two principles. First, to guarantee the robustness, the temporal information should be highly related to the informative modalities, e.g., the optical flow. Second, only basic operations could be applied to make the computational cost affordable when extracting the temporal information. Based on these principles, we introduce a novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distil temporal information through a fast and robust approach. The OFF is derived from the definition of optical flow and is orthogonal to the optical flow. The derivation also provides theoretical support for using the difference between two frames. By directly calculating pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be embedded in any existing CNN based video action recognition framework with only a slight additional cost. It enables the CNN to extract spatiotemporal information. This simple but powerful idea is validated by experimental results. The network with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on UCF-101, which is comparable with the result obtained by two streams (RGB and optical flow), but is 15 times faster in speed. Experimental results also show that OFF is complementary to other motion modalities such as optical flow. When the proposed method is plugged into the state-of-the-art video action recognition framework, it has 96.0% and 74.2% accuracy on UCF-101 and HMDB-51 respectively
Resource efficient action recognition in videos
This thesis traces an innovative journey in the domain of real-world action recognition, in particular focusing on memory and data efficient systems. It begins by introducing a novel approach for smart frame selection, which significantly reduces computational costs in video classification. It further optimizes the action recognition process by addressing the challenges of training time and memory consumption in video transformers, laying a strong foundation for memory efficient action recognition.
The thesis then delves into zero-shot learning, focusing on the flaws of the currently existing protocol and establishing a new split for true zero-shot action recognition, ensuring zero overlap between unseen test classes and training or pre-training classes. Building on this, a unique cluster-based representation, optimized using reinforcement learning, is proposed for zero-shot action recognition. Crucially, we show that a joint
visual-semantic representation learning is essential for improved performance. We also experiment with feature generation approaches for zero-shot action recognition by introducing a synthetic sample selection methodology extending the utility of zero-shot learning to both images and videos and selecting high-quality samples for synthetic data augmentation. This form of data valuation is then incorporated for our novel video data augmentation approach where we generate video composites using foreground and background mixing of videos. The data valuation helps us choose good composites at a reduced overall cost. Finally, we propose the creation of a meaningful semantic space for action labels. We create a textual description dataset for each action class and propose a novel feature generating approach to maximise the benefits of this semantic space. The research contributes significantly to the field, potentially paving the way for more efficient, resource-friendly, and robust video processing and understanding techniques
Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer
Skeleton-based action recognition receives the attention of many researchers
as it is robust to viewpoint and illumination changes, and its processing is
much more efficient than video frames. With the emergence of deep learning
models, it has become very popular to represent the skeleton data in
pseudo-image form and apply Convolutional Neural Networks for action
recognition. Thereafter, studies concentrated on finding effective methods for
forming pseudo-images. Recently, attention networks, more specifically
transformers have provided promising results in various vision problems. In
this study, the effectiveness of vision transformers for skeleton-based action
recognition is examined and its robustness on the pseudo-image representation
scheme is investigated. To this end, a three-level architecture, Act-VIT is
proposed, which forms a set of pseudo images apply a classifier on each of the
representation and combine their results to find the final action class. The
classifiers of Act-VIT are first realized by CNNs and then by VITs and their
performances are compared. Experimental studies reveal that the vision
transformer is less sensitive to the initial pseudo-image representation
compared to CNN. Nevertheless, even with the vision transformer, the
recognition performance can be further improved by consensus of classifiers
Large-scale interactive exploratory visual search
Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences
Improved Multi-resolution Analysis of the Motion Patterns in Video for Human Action Classification
The automatic recognition of human actions in video is of great interest in many applications
such as automated surveillance, content-based video summarization, video search, and indexing.
The problem is challenging due to a wide range of variations among the motion pattern of a given
action such as walking across different subjects and the low variations among similar motions
such as running and jogging.
This thesis has three contributions in a discriminative bottom-up framework to improve the
multi-resolution analysis of the motion patterns in video for better recognition of human actions.
The first contribution of this thesis is the introduction of a novel approach for a robust local
motion feature detection in video. To this end, four different multi-resolution temporally
causal and asymmetric filters of log Gaussian, scale-derivative Gaussian, Poisson, and asymmetric
sinc are introduced. The performance of these filters is compared with the widely used
multi-resolution Gabor filter in a common framework for detection of local salient motions. The
features obtained from the asymmetric filtering are more precise and more robust under geometric
deformations such as view change or affine transformations. Moreover, they provide higher
classification accuracy when they are used with a standard bag-of-words representation of actions
and a single discriminative classifier. The experimental results show that the asymmetric
sinc performs the best. The Poisson and the scale-derivative Gaussian perform better than log
Gaussian and that better than the symmetric temporal Gabor filter.
The second contribution of this thesis is the introduction of an efficient action representation.
The observation is that the salient features at different spatial and temporal scales characterize
different motion information. A multi-resolution analysis of the motion characteristic should be
representative of different actions. A multi-resolution action signature provides a more discriminative
video representation.
The third contribution of this thesis is on the classification of different human actions. To this
end, an ensemble of classifiers in a multiple classifier systems (MCS) framework with a parallel
topology is utilized. This framework can fully benefit from the multi-resolution characteristics
of the motion patterns in the human actions. The classification combination concept of the MCS
has been then extended to address two problems in the configuration setting of a recognition
framework, namely the choice of distance metric for comparing the action representations and
the size of the codebook by which an action is represented. This implication of MCS at multiple
stages of the recognition pipeline provides a multi-stage MCS framework which outperforms the
existing methods which use a single classifier.
Based on the experimental results of the local feature detection and the action classification,
the multi-stage MCS framework, which uses the multi-scale features obtained from the temporal
asymmetric sinc filtering, is recommended for the task of human action recognition in video.1 yea
Sparse and low rank approximations for action recognition
Action recognition is crucial area of research in computer vision with wide range of
applications in surveillance, patient-monitoring systems, video indexing, Human-
Computer Interaction and many more. These applications require automated
action recognition. Robust classification methods are sought-after despite influential
research in this field over past decade. The data resources have grown
tremendously owing to the advances in the digital revolution which cannot be
compared to the meagre resources in the past. The main limitation on a system
when dealing with video data is the computational burden due to large dimensions
and data redundancy. Sparse and low rank approximation methods have evolved
recently which aim at concise and meaningful representation of data. This thesis
explores the application of sparse and low rank approximation methods in the
context of video data classification with the following contributions.
1. An approach for solving the problem of action and gesture classification is
proposed within the sparse representation domain, effectively dealing with
large feature dimensions,
2. Low rank matrix completion approach is proposed to jointly classify more
than one action
3. Deep features are proposed for robust classification of multiple actions
within matrix completion framework which can handle data deficiencies.
This thesis starts with the applicability of sparse representations based classifi-
cation methods to the problem of action and gesture recognition. Random projection
is used to reduce the dimensionality of the features. These are referred
to as compressed features in this thesis. The dictionary formed with compressed
features has proved to be efficient for the classification task achieving comparable
results to the state of the art.
Next, this thesis addresses the more promising problem of simultaneous classifi-
cation of multiple actions. This is treated as matrix completion problem under
transduction setting. Matrix completion methods are considered as the generic
extension to the sparse representation methods from compressed sensing point
of view. The features and corresponding labels of the training and test data are
concatenated and placed as columns of a matrix. The unknown test labels would
be the missing entries in that matrix. This is solved using rank minimization
techniques based on the assumption that the underlying complete matrix would
be a low rank one. This approach has achieved results better than the state of the art on datasets with varying complexities.
This thesis then extends the matrix completion framework for joint classification
of actions to handle the missing features besides missing test labels. In
this context, deep features from a convolutional neural network are proposed.
A convolutional neural network is trained on the training data and features are
extracted from train and test data from the trained network. The performance
of the deep features has proved to be promising when compared to the state of
the art hand-crafted features
- …