3,232 research outputs found
Active Learning for Online Recognition of Human Activities from Streaming Videos
Recognising human activities from streaming videos poses unique challenges to
learning algorithms: predictive models need to be scalable, incrementally
trainable, and must remain bounded in size even when the data stream is
arbitrarily long. Furthermore, as parameter tuning is problematic in a
streaming setting, suitable approaches should be parameterless, and make no
assumptions on what class labels may occur in the stream. We present here an
approach to the recognition of human actions from streaming data which meets
all these requirements by: (1) incrementally learning a model which adaptively
covers the feature space with simple local classifiers; (2) employing an active
learning strategy to reduce annotation requests; (3) achieving promising
accuracy within a fixed model size. Extensive experiments on standard
benchmarks show that our approach is competitive with state-of-the-art
non-incremental methods, and outperforms the existing active incremental
baselines
Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video
Current approaches for activity recognition often ignore constraints on
computational resources: 1) they rely on extensive feature computation to
obtain rich descriptors on all frames, and 2) they assume batch-mode access to
the entire test video at once. We propose a new active approach to activity
recognition that prioritizes "what to compute when" in order to make timely
predictions. The main idea is to learn a policy that dynamically schedules the
sequence of features to compute on selected frames of a given test video. In
contrast to traditional static feature selection, our approach continually
re-prioritizes computation based on the accumulated history of observations and
accounts for the transience of those observations in ongoing video. We develop
variants to handle both the batch and streaming settings. On two challenging
datasets, our method provides significantly better accuracy than alternative
techniques for a wide range of computational budgets
Context-Aware Query Selection for Active Learning in Event Recognition
Activity recognition is a challenging problem with many practical
applications. In addition to the visual features, recent approaches have
benefited from the use of context, e.g., inter-relationships among the
activities and objects. However, these approaches require data to be labeled,
entirely available beforehand, and not designed to be updated continuously,
which make them unsuitable for surveillance applications. In contrast, we
propose a continuous-learning framework for context-aware activity recognition
from unlabeled video, which has two distinct advantages over existing methods.
First, it employs a novel active-learning technique that not only exploits the
informativeness of the individual activities but also utilizes their contextual
information during query selection; this leads to significant reduction in
expensive manual annotation effort. Second, the learned models can be adapted
online as more data is available. We formulate a conditional random field model
that encodes the context and devise an information-theoretic approach that
utilizes entropy and mutual information of the nodes to compute the set of most
informative queries, which are labeled by a human. These labels are combined
with graphical inference techniques for incremental updates. We provide a
theoretical formulation of the active learning framework with an analytic
solution. Experiments on six challenging datasets demonstrate that our
framework achieves superior performance with significantly less manual
labeling.Comment: To appear in Transactions of Pattern Pattern Analysis and Machine
Intelligence (T-PAMI
Energy-based Models for Video Anomaly Detection
Automated detection of abnormalities in data has been studied in research
area in recent years because of its diverse applications in practice including
video surveillance, industrial damage detection and network intrusion
detection. However, building an effective anomaly detection system is a
non-trivial task since it requires to tackle challenging issues of the shortage
of annotated data, inability of defining anomaly objects explicitly and the
expensive cost of feature engineering procedure. Unlike existing appoaches
which only partially solve these problems, we develop a unique framework to
cope the problems above simultaneously. Instead of hanlding with ambiguous
definition of anomaly objects, we propose to work with regular patterns whose
unlabeled data is abundant and usually easy to collect in practice. This allows
our system to be trained completely in an unsupervised procedure and liberate
us from the need for costly data annotation. By learning generative model that
capture the normality distribution in data, we can isolate abnormal data points
that result in low normality scores (high abnormality scores). Moreover, by
leverage on the power of generative networks, i.e. energy-based models, we are
also able to learn the feature representation automatically rather than
replying on hand-crafted features that have been dominating anomaly detection
research over many decades. We demonstrate our proposal on the specific
application of video anomaly detection and the experimental results indicate
that our method performs better than baselines and are comparable with
state-of-the-art methods in many benchmark video anomaly detection datasets
pROST : A Smoothed Lp-norm Robust Online Subspace Tracking Method for Realtime Background Subtraction in Video
An increasing number of methods for background subtraction use Robust PCA to
identify sparse foreground objects. While many algorithms use the L1-norm as a
convex relaxation of the ideal sparsifying function, we approach the problem
with a smoothed Lp-norm and present pROST, a method for robust online subspace
tracking. The algorithm is based on alternating minimization on manifolds.
Implemented on a graphics processing unit it achieves realtime performance.
Experimental results on a state-of-the-art benchmark for background subtraction
on real-world video data indicate that the method succeeds at a broad variety
of background subtraction scenarios, and it outperforms competing approaches
when video quality is deteriorated by camera jitter
Continuous Adaptation of Multi-Camera Person Identification Models through Sparse Non-redundant Representative Selection
The problem of image-base person identification/recognition is to provide an
identity to the image of an individual based on learned models that describe
his/her appearance. Most traditional person identification systems rely on
learning a static model on tediously labeled training data. Though labeling
manually is an indispensable part of a supervised framework, for a large scale
identification system labeling huge amount of data is a significant overhead.
For large multi-sensor data as typically encountered in camera networks,
labeling a lot of samples does not always mean more information, as redundant
images are labeled several times. In this work, we propose a convex
optimization based iterative framework that progressively and judiciously
chooses a sparse but informative set of samples for labeling, with minimal
overlap with previously labeled images. We also use a structure preserving
sparse reconstruction based classifier to reduce the training burden typically
seen in discriminative classifiers. The two stage approach leads to a novel
framework for online update of the classifiers involving only the incorporation
of new labeled data rather than any expensive training phase. We demonstrate
the effectiveness of our approach on multi-camera person re-identification
datasets, to demonstrate the feasibility of learning online classification
models in multi-camera big data applications. Using three benchmark datasets,
we validate our approach and demonstrate that our framework achieves superior
performance with significantly less amount of manual labeling
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
Video learning is an important task in computer vision and has experienced
increasing interest over the recent years. Since even a small amount of videos
easily comprises several million frames, methods that do not rely on a
frame-level annotation are of special importance. In this work, we propose a
novel learning algorithm with a Viterbi-based loss that allows for online and
incremental learning of weakly annotated video data. We moreover show that
explicit context and length modeling leads to huge improvements in video
segmentation and labeling tasks andinclude these models into our framework. On
several action segmentation benchmarks, we obtain an improvement of up to 10%
compared to current state-of-the-art methods.Comment: CVPR 201
Detection of Unknown Anomalies in Streaming Videos with Generative Energy-based Boltzmann Models
Abnormal event detection is one of the important objectives in research and
practical applications of video surveillance. However, there are still three
challenging problems for most anomaly detection systems in practical setting:
limited labeled data, ambiguous definition of "abnormal" and expensive feature
engineering steps. This paper introduces a unified detection framework to
handle these challenges using energy-based models, which are powerful tools for
unsupervised representation learning. Our proposed models are firstly trained
on unlabeled raw pixels of image frames from an input video rather than
hand-crafted visual features; and then identify the locations of abnormal
objects based on the errors between the input video and its reconstruction
produced by the models. To handle video stream, we develop an online version of
our framework, wherein the model parameters are updated incrementally with the
image frames arriving on the fly. Our experiments show that our detectors,
using Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs)
as core modules, achieve superior anomaly detection performance to unsupervised
baselines and obtain accuracy comparable with the state-of-the-art approaches
when evaluating at the pixel-level. More importantly, we discover that our
system trained with DBMs is able to simultaneously perform scene clustering and
scene reconstruction. This capacity not only distinguishes our method from
other existing detectors but also offers a unique tool to investigate and
understand how the model works.Comment: This manuscript is under consideration at Pattern Recognition Letter
Time Perception Machine: Temporal Point Processes for the When, Where and What of Activity Prediction
Numerous powerful point process models have been developed to understand
temporal patterns in sequential data from fields such as health-care,
electronic commerce, social networks, and natural disaster forecasting. In this
paper, we develop novel models for learning the temporal distribution of human
activities in streaming data (e.g., videos and person trajectories). We propose
an integrated framework of neural networks and temporal point processes for
predicting when the next activity will happen. Because point processes are
limited to taking event frames as input, we propose a simple yet effective
mechanism to extract features at frames of interest while also preserving the
rich information in the remaining frames. We evaluate our model on two
challenging datasets. The results show that our model outperforms traditional
statistical point process approaches significantly, demonstrating its
effectiveness in capturing the underlying temporal dynamics as well as the
correlation within sequential activities. Furthermore, we also extend our model
to a joint estimation framework for predicting the timing, spatial location,
and category of the activity simultaneously, to answer the when, where, and
what of activity prediction
LIBSVX: A Supervoxel Library and Benchmark for Early Video Processing
Supervoxel segmentation has strong potential to be incorporated into early
video analysis as superpixel segmentation has in image analysis. However, there
are many plausible supervoxel methods and little understanding as to when and
where each is most appropriate. Indeed, we are not aware of a single
comparative study on supervoxel segmentation. To that end, we study seven
supervoxel algorithms, including both off-line and streaming methods, in the
context of what we consider to be a good supervoxel: namely, spatiotemporal
uniformity, object/region boundary detection, region compression and parsimony.
For the evaluation we propose a comprehensive suite of seven quality metrics to
measure these desirable supervoxel characteristics. In addition, we evaluate
the methods in a supervoxel classification task as a proxy for subsequent
high-level uses of the supervoxels in video analysis. We use six existing
benchmark video datasets with a variety of content-types and dense human
annotations. Our findings have led us to conclusive evidence that the
hierarchical graph-based (GBH), segmentation by weighted aggregation (SWA) and
temporal superpixels (TSP) methods are the top-performers among the seven
methods. They all perform well in terms of segmentation accuracy, but vary in
regard to the other desiderata: GBH captures object boundaries best; SWA has
the best potential for region compression; and TSP achieves the best
undersegmentation error.Comment: In Review at International Journal of Computer Visio
- …