35,665 research outputs found
Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition
Historically, researchers in the field have spent a great deal of effort to
create image representations that have scale invariance and retain spatial
location information. This paper proposes to encode equivalent temporal
characteristics in video representations for action recognition. To achieve
temporal scale invariance, we develop a method called temporal scale pyramid
(TSP). To encode temporal information, we present and compare two methods
called temporal extension descriptor (TED) and temporal division pyramid (TDP)
. Our purpose is to suggest solutions for matching complex actions that have
large variation in velocity and appearance, which is missing from most current
action representations. The experimental results on four benchmark datasets,
UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach and
significantly outperform state-of-the-art methods. Most noticeably, we achieve
65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51
and Hollywood2 datasets which constitutes an absolute improvement over the
state-of-the-art by 7.8% and 3.9%, respectively
End-to-end Flow Correlation Tracking with Spatial-temporal Attention
Discriminative correlation filters (DCF) with deep convolutional features
have achieved favorable performance in recent tracking benchmarks. However,
most of existing DCF trackers only consider appearance features of current
frame, and hardly benefit from motion and inter-frame information. The lack of
temporal information degrades the tracking performance during challenges such
as partial occlusion and deformation. In this work, we focus on making use of
the rich flow information in consecutive frames to improve the feature
representation and the tracking accuracy. Firstly, individual components,
including optical flow estimation, feature extraction, aggregation and
correlation filter tracking are formulated as special layers in network. To the
best of our knowledge, this is the first work to jointly train flow and
tracking task in a deep learning framework. Then the historical feature maps at
predefined intervals are warped and aggregated with current ones by the guiding
of flow. For adaptive aggregation, we propose a novel spatial-temporal
attention mechanism. Extensive experiments are performed on four challenging
tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016, and the proposed
method achieves superior results on these benchmarks.Comment: Accepted in CVPR 201
Search Tracker: Human-derived object tracking in-the-wild through large-scale search and retrieval
Humans use context and scene knowledge to easily localize moving objects in
conditions of complex illumination changes, scene clutter and occlusions. In
this paper, we present a method to leverage human knowledge in the form of
annotated video libraries in a novel search and retrieval based setting to
track objects in unseen video sequences. For every video sequence, a document
that represents motion information is generated. Documents of the unseen video
are queried against the library at multiple scales to find videos with similar
motion characteristics. This provides us with coarse localization of objects in
the unseen video. We further adapt these retrieved object locations to the new
video using an efficient warping scheme. The proposed method is validated on
in-the-wild video surveillance datasets where we outperform state-of-the-art
appearance-based trackers. We also introduce a new challenging dataset with
complex object appearance changes.Comment: Under review with the IEEE Transactions on Circuits and Systems for
Video Technolog
Machine Understanding of Human Behavior
A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
- …