13,205 research outputs found
Research on Object Tracking Technology for Orderless and Blurred Movement under Complex Scenes
University of Technology Sydney. Faculty of Engineering and Information Technology.Visual tracking is widely found in anomaly behaviour detection, self-driving, virtual reality. Recent researches reported that classic methods, including the Tracking-Learning-Detection method, the Particle Filter and the mean shift, were surpassed by deep learning in accuracy and correlation filtering in speed. However, correlation filtering can be affected by boundary effects. The conventional correlation filtering fixes the size of its detection window. When its detection window only captures partial target images due to large and sudden scale variations, the correlation filtering fails to locate the tracked target. When the target is undergoing violent shaking, motion blurs and orderless movements appear along with it. The conventional correlation filtering locks itself in the previous position of the target, and hence, the target is out of the sight of the correlation filtering. In this case, the correlation filtering drifts or fails to track. Therefore, this thesis topic is to track single-objects under complex scenes with attributes of motion blurs, orderless motions and scale variations. The main research innovation is listed as follows.
(1) An approach for searching orderless movements is designed in a generative-discriminative tracking model. To address the uncertain orderless movements, a coarse-to-fine tracking framework is adopted. A spatio-temporal correlation is learned for the detection in the subsequent frames. Experiments are conducted on public databases with orderless motion attributes to validate the robustness of the proposed approach.
(2) A template matching method is proposed for tracking objects with motion blurs. An effective target motion model is designed to provide supplementary appearance features. A robust similarity measure is proposed to address the outliers caused by motion blurs. Our approach outperforms other approaches in a public benchmark database with motion blurs.
(3) An ensemble framework is designed to tackle scale variations. The scale of a target is estimated based on the Gaussian Particle Filtering. A high-confidence strategy is used to validate the reliability of tracking results. Our approach with hand-crafted or CNN features outperforms the methods based on correlation filtering and deep learning in databases with scale variations.
To sum up, this thesis addresses boundary effects, model drifts, fixed search windows and easily interfered hand-crafted features of objects. Different trackers are proposed for tracking single-objects with orderless movements, motion blurs and scale variations. As future work, our methods can be extended to using a neural network to further improve single-object tracking models
Unsupervised Object Discovery and Tracking in Video Collections
This paper addresses the problem of automatically localizing dominant objects
as spatio-temporal tubes in a noisy collection of videos with minimal or even
no supervision. We formulate the problem as a combination of two complementary
processes: discovery and tracking. The first one establishes correspondences
between prominent regions across videos, and the second one associates
successive similar object regions within the same video. Interestingly, our
algorithm also discovers the implicit topology of frames associated with
instances of the same object class across different videos, a role normally
left to supervisory information in the form of class labels in conventional
image and video understanding methods. Indeed, as demonstrated by our
experiments, our method can handle video collections featuring multiple object
classes, and substantially outperforms the state of the art in colocalization,
even though it tackles a broader problem with much less supervision
Learning to track for spatio-temporal action localization
We propose an effective approach for spatio-temporal action localization in
realistic videos. The approach first detects proposals at the frame-level and
scores them with a combination of static and motion CNN features. It then
tracks high-scoring proposals throughout the video using a
tracking-by-detection approach. Our tracker relies simultaneously on
instance-level and class-level detectors. The tracks are scored using a
spatio-temporal motion histogram, a descriptor at the track level, in
combination with the CNN features. Finally, we perform temporal localization of
the action using a sliding-window approach at the track level. We present
experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB
and UCF-101 action localization datasets, where our approach outperforms the
state of the art with a margin of 15%, 7% and 12% respectively in mAP
Click Carving: Segmenting Objects in Video with Point Clicks
We present a novel form of interactive video object segmentation where a few
clicks by the user helps the system produce a full spatio-temporal segmentation
of the object of interest. Whereas conventional interactive pipelines take the
user's initialization as a starting point, we show the value in the system
taking the lead even in initialization. In particular, for a given video frame,
the system precomputes a ranked list of thousands of possible segmentation
hypotheses (also referred to as object region proposals) using image and motion
cues. Then, the user looks at the top ranked proposals, and clicks on the
object boundary to carve away erroneous ones. This process iterates (typically
2-3 times), and each time the system revises the top ranked proposal set, until
the user is satisfied with a resulting segmentation mask. Finally, the mask is
propagated across the video to produce a spatio-temporal object tube. On three
challenging datasets, we provide extensive comparisons with both existing work
and simpler alternative methods. In all, the proposed Click Carving approach
strikes an excellent balance of accuracy and human effort. It outperforms all
similarly fast methods, and is competitive or better than those requiring 2 to
12 times the effort.Comment: A preliminary version of the material in this document was filed as
University of Texas technical report no. UT AI16-0
DART: Distribution Aware Retinal Transform for Event-based Cameras
We introduce a generic visual descriptor, termed as distribution aware
retinal transform (DART), that encodes the structural context using log-polar
grids for event cameras. The DART descriptor is applied to four different
problems, namely object classification, tracking, detection and feature
matching: (1) The DART features are directly employed as local descriptors in a
bag-of-features classification framework and testing is carried out on four
standard event-based object datasets (N-MNIST, MNIST-DVS, CIFAR10-DVS,
NCaltech-101). (2) Extending the classification system, tracking is
demonstrated using two key novelties: (i) For overcoming the low-sample problem
for the one-shot learning of a binary classifier, statistical bootstrapping is
leveraged with online learning; (ii) To achieve tracker robustness, the scale
and rotation equivariance property of the DART descriptors is exploited for the
one-shot learning. (3) To solve the long-term object tracking problem, an
object detector is designed using the principle of cluster majority voting. The
detection scheme is then combined with the tracker to result in a high
intersection-over-union score with augmented ground truth annotations on the
publicly available event camera dataset. (4) Finally, the event context encoded
by DART greatly simplifies the feature correspondence problem, especially for
spatio-temporal slices far apart in time, which has not been explicitly tackled
in the event-based vision domain.Comment: 12 pages, revision submitted to TPAMI in Nov 201
Efficient and effective human action recognition in video through motion boundary description with a compact set of trajectories
Human action recognition (HAR) is at the core of human-computer interaction and video scene understanding. However, achieving effective HAR in an unconstrained environment is still a challenging task. To that end, trajectory-based video representations are currently widely used. Despite the promising levels of effectiveness achieved by these approaches, problems regarding computational complexity and the presence of redundant trajectories still need to be addressed in a satisfactory way. In this paper, we propose a method for trajectory rejection, reducing the number of redundant trajectories without degrading the effectiveness of HAR. Furthermore, to realize efficient optical flow estimation prior to trajectory extraction, we integrate a method for dynamic frame skipping. Experiments with four publicly available human action datasets show that the proposed approach outperforms state-of-the-art HAR approaches in terms of effectiveness, while simultaneously mitigating the computational complexity
- …