17,598 research outputs found
Simple and Complex Human Action Recognition in Constrained and Unconstrained Videos
Human action recognition plays a crucial role in visual learning applications such as video understanding and surveillance, video retrieval, human-computer interactions, and autonomous driving systems. A variety of methodologies have been proposed for human action recognition via developing of low-level features along with the bag-of-visual-word models. However, much less research has been performed on the compound of pre-processing, encoding and classification stages. This dissertation focuses on enhancing the action recognition performances via ensemble learning, hybrid classifier, hierarchical feature representation, and key action perception methodologies. Action variation is one of the crucial challenges in video analysis and action recognition. We address this problem by proposing the hybrid classifier (HC) to discriminate actions which contain similar forms of motion features such as walking, running, and jogging. Aside from that, we show and proof that the fusion of various appearance-based and motion features can boost the simple and complex action recognition performance. The next part of the dissertation introduces pooled-feature representation (PFR) which is derived from a double phase encoding framework (DPE). Considering that a given unconstrained video is composed of a sequence of simple frames, the first phase of DPE generates temporal sub-volumes from the video and represents them individually by employing the proposed improved rank pooling (IRP) method. The second phase constructs the pool of features by fusing the represented vectors from the first phase. The pool is compressed and then encoded to provide video-parts vector (VPV). The DPE framework allows distilling the video representation and hierarchically extracting new information. Compared with recent video encoding approaches, VPV can preserve the higher-level information through standard encoding of low-level features in two phases. Furthermore, the encoded vectors from both phases of DPE are fused along with a compression stage to develop PFR
Deep Structured Models For Group Activity Recognition
This paper presents a deep neural-network-based hierarchical graphical model
for individual and group activity recognition in surveillance scenes. Deep
networks are used to recognize the actions of individual people in a scene.
Next, a neural-network-based hierarchical graphical model refines the predicted
labels for each class by considering dependencies between the classes. This
refinement step mimics a message-passing step similar to inference in a
probabilistic graphical model. We show that this approach can be effective in
group activity recognition, with the deep graphical model improving recognition
rates over baseline methods
Crowded Scene Analysis: A Survey
Automated scene analysis has been a topic of great interest in computer
vision and cognitive science. Recently, with the growth of crowd phenomena in
the real world, crowded scene analysis has attracted much attention. However,
the visual occlusions and ambiguities in crowded scenes, as well as the complex
behaviors and scene semantics, make the analysis a challenging task. In the
past few years, an increasing number of works on crowded scene analysis have
been reported, covering different aspects including crowd motion pattern
learning, crowd behavior and activity analysis, and anomaly detection in
crowds. This paper surveys the state-of-the-art techniques on this topic. We
first provide the background knowledge and the available features related to
crowded scenes. Then, existing models, popular algorithms, evaluation
protocols, as well as system performance are provided corresponding to
different aspects of crowded scene analysis. We also outline the available
datasets for performance evaluation. Finally, some research problems and
promising future directions are presented with discussions.Comment: 20 pages in IEEE Transactions on Circuits and Systems for Video
Technology, 201
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
A Hierarchical Deep Temporal Model for Group Activity Recognition
In group activity recognition, the temporal dynamics of the whole activity
can be inferred based on the dynamics of the individual people representing the
activity. We build a deep model to capture these dynamics based on LSTM
(long-short term memory) models. To make use of these ob- servations, we
present a 2-stage deep temporal model for the group activity recognition
problem. In our model, a LSTM model is designed to represent action dynamics of
in- dividual people in a sequence and another LSTM model is designed to
aggregate human-level information for whole activity understanding. We evaluate
our model over two datasets: the collective activity dataset and a new volley-
ball dataset. Experimental results demonstrate that our proposed model improves
group activity recognition perfor- mance with compared to baseline methods.Comment: cs.cv Accepted to CVPR 201
Collective behavior recognition using compact descriptors
This paper presents a novel hierarchical approach for collective behavior
recognition based solely on ground-plane trajectories. In the first layer of
our classifier, we introduce a novel feature called Personal Interaction
Descriptor (PID), which combines the spatial distribution of a pair of
pedestrians within a temporal window with a pyramidal representation of the
relative speed to detect pairwise interactions. These interactions are then
combined with higher level features related to the mean speed and shape formed
by the pedestrians in the scene, generating a Collective Behavior Descriptor
(CBD) that is used to identify collective behaviors in a second stage. In both
layers, Random Forests were used as classifiers, since they allow features of
different natures to be combined seamlessly. Our experimental results indicate
that the proposed method achieves results on par with state of the art
techniques with a better balance of class errors. Moreover, we show that our
method can generalize well across different camera setups through cross-dataset
experiments.Comment: 8 pages, 6 figures. Paper submitted do Pattern Recognition Letter
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Energy-based Models for Video Anomaly Detection
Automated detection of abnormalities in data has been studied in research
area in recent years because of its diverse applications in practice including
video surveillance, industrial damage detection and network intrusion
detection. However, building an effective anomaly detection system is a
non-trivial task since it requires to tackle challenging issues of the shortage
of annotated data, inability of defining anomaly objects explicitly and the
expensive cost of feature engineering procedure. Unlike existing appoaches
which only partially solve these problems, we develop a unique framework to
cope the problems above simultaneously. Instead of hanlding with ambiguous
definition of anomaly objects, we propose to work with regular patterns whose
unlabeled data is abundant and usually easy to collect in practice. This allows
our system to be trained completely in an unsupervised procedure and liberate
us from the need for costly data annotation. By learning generative model that
capture the normality distribution in data, we can isolate abnormal data points
that result in low normality scores (high abnormality scores). Moreover, by
leverage on the power of generative networks, i.e. energy-based models, we are
also able to learn the feature representation automatically rather than
replying on hand-crafted features that have been dominating anomaly detection
research over many decades. We demonstrate our proposal on the specific
application of video anomaly detection and the experimental results indicate
that our method performs better than baselines and are comparable with
state-of-the-art methods in many benchmark video anomaly detection datasets
Vision and Challenges for Knowledge Centric Networking (KCN)
In the creation of a smart future information society, Internet of Things
(IoT) and Content Centric Networking (CCN) break two key barriers for both the
front-end sensing and back-end networking. However, we still observe the
missing piece of the research that dominates the current networking traffic
control and system management, e.g., lacking of the knowledge penetrated into
both sensing and networking to glue them holistically. In this paper, we
envision to leverage emerging machine learning or deep learning techniques to
create aspects of knowledge for facilitating the designs. In particular, we can
extract knowledge from collected data to facilitate reduced data volume,
enhanced system intelligence and interactivity, improved service quality,
communication with better controllability and lower cost. We name such a
knowledge-oriented traffic control and networking management paradigm as the
Knowledge Centric Networking (KCN). This paper presents KCN rationale, KCN
benefits, related works and research opportunities
RGB-D-based Action Recognition Datasets: A Survey
Human action recognition from RGB-D (Red, Green, Blue and Depth) data has
attracted increasing attention since the first work reported in 2010. Over this
period, many benchmark datasets have been created to facilitate the development
and evaluation of new algorithms. This raises the question of which dataset to
select and how to use it in providing a fair and objective comparative
evaluation against state-of-the-art methods. To address this issue, this paper
provides a comprehensive review of the most commonly used action recognition
related RGB-D video datasets, including 27 single-view datasets, 10 multi-view
datasets, and 7 multi-person datasets. The detailed information and analysis of
these datasets is a useful resource in guiding insightful selection of datasets
for future research. In addition, the issues with current algorithm evaluation
vis-\'{a}-vis limitations of the available datasets and evaluation protocols
are also highlighted; resulting in a number of recommendations for collection
of new datasets and use of evaluation protocols
- …