8,322 research outputs found
Advances in Human Action Recognition: A Survey
Human action recognition has been an important topic in computer vision due
to its many applications such as video surveillance, human machine interaction
and video retrieval. One core problem behind these applications is
automatically recognizing low-level actions and high-level activities of
interest. The former is usually the basis for the latter. This survey gives an
overview of the most recent advances in human action recognition during the
past several years, following a well-formed taxonomy proposed by a previous
survey. From this state-of-the-art survey, researchers can view a panorama of
progress in this area for future research
ADS-ME: Anomaly Detection System for Micro-expression Spotting
Micro-expressions (MEs) are infrequent and uncontrollable facial events that
can highlight emotional deception and appear in a high-stakes environment. This
paper propose an algorithm for spatiotemporal MEs spotting. Since MEs are
unusual events, we treat them as abnormal patterns that diverge from expected
Normal Facial Behaviour (NFBs) patterns. NFBs correspond to facial muscle
activations, eye blink/gaze events and mouth opening/closing movements that are
all facial deformation but not MEs. We propose a probabilistic model to
estimate the probability density function that models the spatiotemporal
distributions of NFBs patterns. To rank the outputs, we compute the negative
log-likelihood and we developed an adaptive thresholding technique to identify
MEs from NFBs. While working only with NFBs data, the main challenge is to
capture intrinsic spatiotemoral features, hence we design a recurrent
convolutional autoencoder for feature representation. Finally, we show that our
system is superior to previous works for MEs spotting.Comment: 35 pages, 9 figures, 3 table
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Moments in Time Dataset: one million videos for event understanding
We present the Moments in Time Dataset, a large-scale human-annotated
collection of one million short videos corresponding to dynamic events
unfolding within three seconds. Modeling the spatial-audio-temporal dynamics
even for actions occurring in 3 second videos poses many challenges: meaningful
events do not include only people, but also objects, animals, and natural
phenomena; visual and auditory events can be symmetrical in time ("opening" is
"closing" in reverse), and either transient or sustained. We describe the
annotation process of our dataset (each video is tagged with one action or
activity label among 339 different classes), analyze its scale and diversity in
comparison to other large-scale video datasets for action recognition, and
report results of several baseline models addressing separately, and jointly,
three modalities: spatial, temporal and auditory. The Moments in Time dataset,
designed to have a large coverage and diversity of events in both visual and
auditory modalities, can serve as a new challenge to develop models that scale
to the level of complexity and abstract reasoning that a human processes on a
daily basis
Learning Representative Temporal Features for Action Recognition
In this paper, a novel video classification methodology is presented that
aims to recognize different categories of third-person videos efficiently. The
idea here is to break the 3-dimensional video input to 2D in spatial plus 1D in
temporal dimension. Firstly, optical flow images are described by well
pre-trained networks to process the 2D spatial frames. Then, motion in the
video is kept tracked by aligning the optical flow elements over time which can
be seen as multi-channel time series. The main focus of the proposed method is
to classify the resulted time series efficiently. Towards this, the idea is to
let the machine learn temporal features along time dimension. This is done by
training a multi-channel one dimensional Convolutional Neural Network (1D-CNN).
Due to the fact that CNNs represent the input data hierarchically, high level
features are obtained by further processing of features in the lower level
layers. As a result, long-term temporal features in time series are extracted
from short-term ones. These long-term temporal features are the ones which play
the key role in recognizing the ongoing action. Besides, the superiority of the
proposed method over most of the deep-learning based approaches is that we try
to learn representative temporal features across only one dimension. This,
reduces the number of learning parameters significantly. Hence our method would
be trainable on even smaller datasets. It is illustrated that the proposed
method could reach state-of-the-art results on two public datasets UCF11 and
jHMDB and competitive results on HMDB51.Comment: 14 page
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Video In Sentences Out
We present a system that produces sentential descriptions of video: who did
what to whom, and where and how they did it. Action class is rendered as a
verb, participant objects as noun phrases, properties of those objects as
adjectival modifiers in those noun phrases, spatial relations between those
participants as prepositional phrases, and characteristics of the event as
prepositional-phrase adjuncts and adverbial modifiers. Extracting the
information needed to render these linguistic entities requires an approach to
event recognition that recovers object tracks, the trackto-role assignments,
and changing body posture.Comment: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI2012
Uncertainty aware audiovisual activity recognition using deep Bayesian variational inference
Deep neural networks (DNNs) provide state-of-the-art results for a multitude
of applications, but the approaches using DNNs for multimodal audiovisual
applications do not consider predictive uncertainty associated with individual
modalities. Bayesian deep learning methods provide principled confidence and
quantify predictive uncertainty. Our contribution in this work is to propose an
uncertainty aware multimodal Bayesian fusion framework for activity
recognition. We demonstrate a novel approach that combines deterministic and
variational layers to scale Bayesian DNNs to deeper architectures. Our
experiments using in- and out-of-distribution samples selected from a subset of
Moments-in-Time (MiT) dataset show a more reliable confidence measure as
compared to the non-Bayesian baseline and the Monte Carlo dropout (MC dropout)
approximate Bayesian inference. We also demonstrate the uncertainty estimates
obtained from the proposed framework can identify out-of-distribution data on
the UCF101 and MiT datasets. In the multimodal setting, the proposed framework
improved precision-recall AUC by 10.2% on the subset of MiT dataset as compared
to non-Bayesian baseline.Comment: Accepted at ICCV 2019 for Oral presentatio
Thoughts on a Recursive Classifier Graph: a Multiclass Network for Deep Object Recognition
We propose a general multi-class visual recognition model, termed the
Classifier Graph, which aims to generalize and integrate ideas from many of
today's successful hierarchical recognition approaches. Our graph-based model
has the advantage of enabling rich interactions between classes from different
levels of interpretation and abstraction. The proposed multi-class system is
efficiently learned using step by step updates. The structure consists of
simple logistic linear layers with inputs from features that are automatically
selected from a large pool. Each newly learned classifier becomes a potential
new feature. Thus, our feature pool can consist both of initial manually
designed features as well as learned classifiers from previous steps (graph
nodes), each copied many times at different scales and locations. In this
manner we can learn and grow both a deep, complex graph of classifiers and a
rich pool of features at different levels of abstraction and interpretation.
Our proposed graph of classifiers becomes a multi-class system with a recursive
structure, suitable for deep detection and recognition of several classes
simultaneously
Generalized Zero-Shot Learning for Action Recognition with Web-Scale Video Data
Action recognition in surveillance video makes our life safer by detecting
the criminal events or predicting violent emergencies. However, efficient
action recognition is not free of difficulty. First, there are so many action
classes in daily life that we cannot pre-define all possible action classes
beforehand. Moreover, it is very hard to collect real-word videos for certain
particular actions such as steal and street fight due to legal restrictions and
privacy protection. These challenges make existing data-driven recognition
methods insufficient to attain desired performance. Zero-shot learning is
potential to be applied to solve these issues since it can perform
classification without positive example. Nevertheless, current zero-shot
learning algorithms have been studied under the unreasonable setting where seen
classes are absent during the testing phase. Motivated by this, we study the
task of action recognition in surveillance video under a more realistic
\emph{generalized zero-shot setting}, where testing data contains both seen and
unseen classes. To our best knowledge, this is the first work to study video
action recognition under the generalized zero-shot setting. We firstly perform
extensive empirical studies on several existing zero-shot leaning approaches
under this new setting on a web-scale video data. Our experimental results
demonstrate that, under the generalize setting, typical zero-shot learning
methods are no longer effective for the dataset we applied. Then, we propose a
method for action recognition by deploying generalized zero-shot learning,
which transfers the knowledge of web video to detect the anomalous actions in
surveillance videos. To verify the effectiveness of our proposed method, we
further construct a new surveillance video dataset consisting of nine action
classes related to the public safety situation
- …