5,111 research outputs found
Detecting events and key actors in multi-person videos
Multi-person event recognition is a challenging task, often with many people
active in the scene but only a small subset contributing to an actual event. In
this paper, we propose a model which learns to detect events in such videos
while automatically "attending" to the people responsible for the event. Our
model does not use explicit annotations regarding who or where those people are
during training and testing. In particular, we track people in videos and use a
recurrent neural network (RNN) to represent the track features. We learn
time-varying attention weights to combine these features at each time-instant.
The attended features are then processed using another RNN for event
detection/classification. Since most video datasets with multiple people are
restricted to a small number of videos, we also collected a new basketball
dataset comprising 257 basketball games with 14K event annotations
corresponding to 11 event classes. Our model outperforms state-of-the-art
methods for both event classification and detection on this new dataset.
Additionally, we show that the attention mechanism is able to consistently
localize the relevant players.Comment: Accepted for publication in CVPR'1
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Existing audio-visual event localization (AVE) handles manually trimmed
videos with only a single instance in each of them. However, this setting is
unrealistic as natural videos often contain numerous audio-visual events with
different categories. To better adapt to real-life applications, in this paper
we focus on the task of dense-localizing audio-visual events, which aims to
jointly localize and recognize all audio-visual events occurring in an
untrimmed video. The problem is challenging as it requires fine-grained
audio-visual scene and context understanding. To tackle this problem, we
introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains
10K untrimmed videos with over 30K audio-visual events. Each video has 2.8
audio-visual events on average, and the events are usually related to each
other and might co-occur as in real-life scenes. Next, we formulate the task
using a new learning-based framework, which is capable of fully integrating
audio and visual modalities to localize audio-visual events with various
lengths and capture dependencies between them in a single pass. Extensive
experiments demonstrate the effectiveness of our method as well as the
significance of multi-scale cross-modal perception and dependency modeling for
this task.Comment: Accepted by CVPR202
Activity understanding and unusual event detection in surveillance videos
PhDComputer scientists have made ceaseless efforts to replicate cognitive video understanding abilities
of human brains onto autonomous vision systems. As video surveillance cameras become
ubiquitous, there is a surge in studies on automated activity understanding and unusual event detection
in surveillance videos. Nevertheless, video content analysis in public scenes remained a
formidable challenge due to intrinsic difficulties such as severe inter-object occlusion in crowded
scene and poor quality of recorded surveillance footage. Moreover, it is nontrivial to achieve
robust detection of unusual events, which are rare, ambiguous, and easily confused with noise.
This thesis proposes solutions for resolving ambiguous visual observations and overcoming unreliability
of conventional activity analysis methods by exploiting multi-camera visual context
and human feedback.
The thesis first demonstrates the importance of learning visual context for establishing reliable
reasoning on observed activity in a camera network. In the proposed approach, a new Cross
Canonical Correlation Analysis (xCCA) is formulated to discover and quantify time delayed pairwise
correlations of regional activities observed within and across multiple camera views. This
thesis shows that learning time delayed pairwise activity correlations offers valuable contextual
information for (1) spatial and temporal topology inference of a camera network, (2) robust person
re-identification, and (3) accurate activity-based video temporal segmentation. Crucially, in
contrast to conventional methods, the proposed approach does not rely on either intra-camera or
inter-camera object tracking; it can thus be applied to low-quality surveillance videos featuring
severe inter-object occlusions.
Second, to detect global unusual event across multiple disjoint cameras, this thesis extends
visual context learning from pairwise relationship to global time delayed dependency between
regional activities. Specifically, a Time Delayed Probabilistic Graphical Model (TD-PGM) is
proposed to model the multi-camera activities and their dependencies. Subtle global unusual
events are detected and localised using the model as context-incoherent patterns across multiple
camera views. In the model, different nodes represent activities in different decomposed re3
gions from different camera views, and the directed links between nodes encoding time delayed
dependencies between activities observed within and across camera views. In order to learn optimised
time delayed dependencies in a TD-PGM, a novel two-stage structure learning approach
is formulated by combining both constraint-based and scored-searching based structure learning
methods.
Third, to cope with visual context changes over time, this two-stage structure learning approach
is extended to permit tractable incremental update of both TD-PGM parameters and its
structure. As opposed to most existing studies that assume static model once learned, the proposed
incremental learning allows a model to adapt itself to reflect the changes in the current
visual context, such as subtle behaviour drift over time or removal/addition of cameras. Importantly,
the incremental structure learning is achieved without either exhaustive search in a large
graph structure space or storing all past observations in memory, making the proposed solution
memory and time efficient.
Forth, an active learning approach is presented to incorporate human feedback for on-line
unusual event detection. Contrary to most existing unsupervised methods that perform passive
mining for unusual events, the proposed approach automatically requests supervision for critical
points to resolve ambiguities of interest, leading to more robust detection of subtle unusual
events. The active learning strategy is formulated as a stream-based solution, i.e. it makes decision
on-the-fly on whether to request label for each unlabelled sample observed in sequence.
It selects adaptively two active learning criteria, namely likelihood criterion and uncertainty criterion
to achieve (1) discovery of unknown event classes and (2) refinement of classification
boundary.
The effectiveness of the proposed approaches is validated using videos captured from busy
public scenes such as underground stations and traffic intersections
- β¦