13,986 research outputs found
Fast Tracking via Spatio-Temporal Context Learning
In this paper, we present a simple yet fast and robust algorithm which
exploits the spatio-temporal context for visual tracking. Our approach
formulates the spatio-temporal relationships between the object of interest and
its local context based on a Bayesian framework, which models the statistical
correlation between the low-level features (i.e., image intensity and position)
from the target and its surrounding regions. The tracking problem is posed by
computing a confidence map, and obtaining the best target location by
maximizing an object location likelihood function. The Fast Fourier Transform
is adopted for fast learning and detection in this work. Implemented in MATLAB
without code optimization, the proposed tracker runs at 350 frames per second
on an i7 machine. Extensive experimental results show that the proposed
algorithm performs favorably against state-of-the-art methods in terms of
efficiency, accuracy and robustness
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling
It remains a huge challenge to design effective and efficient trackers under
complex scenarios, including occlusions, illumination changes and pose
variations. To cope with this problem, a promising solution is to integrate the
temporal consistency across consecutive frames and multiple feature cues in a
unified model. Motivated by this idea, we propose a novel correlation
filter-based tracker in this work, in which the temporal relatedness is
reconciled under a multi-task learning framework and the multiple feature cues
are modeled using a multi-view learning approach. We demonstrate the resulting
regression model can be efficiently learned by exploiting the structure of
blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm
is developed thereafter for efficient online tracking. Meanwhile, we
incorporate an adaptive scale estimation mechanism to strengthen the stability
of scale variation tracking. We implement our tracker using two types of
features and test it on two benchmark datasets. Experimental results
demonstrate the superiority of our proposed approach when compared with other
state-of-the-art trackers. project homepage
http://bmal.hust.edu.cn/project/KMF2JMTtracking.htmlComment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technology. The MATLAB code of our method is available from
our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.htm
Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images
As a fundamental and challenging problem in computer vision, hand pose
estimation aims to estimate the hand joint locations from depth images.
Typically, the problem is modeled as learning a mapping function from images to
hand joint coordinates in a data-driven manner. In this paper, we propose
Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly
model the spatio-temporal properties for hand pose estimation. Our proposed
network is able to learn the representations of the spatial information and the
temporal structure from the image sequences. Moreover, by adopting adaptive
fusion method, the model is capable of dynamically weighting different
predictions to lay emphasis on sufficient context. Our method is examined on
two common benchmarks, the experimental results demonstrate that our proposed
approach achieves the best or the second-best performance with state-of-the-art
methods and runs in 60fps.Comment: IEEE Transactions On Cybernetic
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
An Experimental Survey on Correlation Filter-based Tracking
Over these years, Correlation Filter-based Trackers (CFTs) have aroused
increasing interests in the field of visual object tracking, and have achieved
extremely compelling results in different competitions and benchmarks. In this
paper, our goal is to review the developments of CFTs with extensive
experimental results. 11 trackers are surveyed in our work, based on which a
general framework is summarized. Furthermore, we investigate different training
schemes for correlation filters, and also discuss various effective
improvements that have been made recently. Comprehensive experiments have been
conducted to evaluate the effectiveness and efficiency of the surveyed CFTs,
and comparisons have been made with other competing trackers. The experimental
results have shown that state-of-art performance, in terms of robustness, speed
and accuracy, can be achieved by several recent CFTs, such as MUSTer and SAMF.
We find that further improvements for correlation filter-based tracking can be
made on estimating scales, applying part-based tracking strategy and
cooperating with long-term tracking methods.Comment: 13 pages, 25 figure
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Crowded Scene Analysis: A Survey
Automated scene analysis has been a topic of great interest in computer
vision and cognitive science. Recently, with the growth of crowd phenomena in
the real world, crowded scene analysis has attracted much attention. However,
the visual occlusions and ambiguities in crowded scenes, as well as the complex
behaviors and scene semantics, make the analysis a challenging task. In the
past few years, an increasing number of works on crowded scene analysis have
been reported, covering different aspects including crowd motion pattern
learning, crowd behavior and activity analysis, and anomaly detection in
crowds. This paper surveys the state-of-the-art techniques on this topic. We
first provide the background knowledge and the available features related to
crowded scenes. Then, existing models, popular algorithms, evaluation
protocols, as well as system performance are provided corresponding to
different aspects of crowded scene analysis. We also outline the available
datasets for performance evaluation. Finally, some research problems and
promising future directions are presented with discussions.Comment: 20 pages in IEEE Transactions on Circuits and Systems for Video
Technology, 201
Are You Imitating Me? Unsupervised Sparse Modeling for Group Activity Analysis from a Single Video
A framework for unsupervised group activity analysis from a single video is
here presented. Our working hypothesis is that human actions lie on a union of
low-dimensional subspaces, and thus can be efficiently modeled as sparse linear
combinations of atoms from a learned dictionary representing the action's
primitives. Contrary to prior art, and with the primary goal of spatio-temporal
action grouping, in this work only one single video segment is available for
both unsupervised learning and analysis without any prior training information.
After extracting simple features at a single spatio-temporal scale, we learn a
dictionary for each individual in the video during each short time lapse. These
dictionaries allow us to compare the individuals' actions by producing an
affinity matrix which contains sufficient discriminative information about the
actions in the scene leading to grouping with simple and efficient tools. With
diverse publicly available real videos, we demonstrate the effectiveness of the
proposed framework and its robustness to cluttered backgrounds, changes of
human appearance, and action variability
- …