136,645 research outputs found
Visual object tracking
University of Technology Sydney. Faculty of Engineering and Information Technology.Visual object tracking is a critical task in many computer-vision-related applications, such as surveillance and robotics. If the tracking target is provided in the first frame of a video, the tracker will predict the location and the shape of the target in the following frames. Despite the significant research effort that has been dedicated to this area for several years, this field remains challenging due to a number of issues, such as occlusion, shape variation and drifting, all of which adversely affect the performance of a tracking algorithm.
This research focuses on incorporating the spatial and temporal context to tackle the challenging issues related to developing robust trackers. The spatial context is what surrounds a given object and the temporal context is what has been observed in the recent past at the same location. In particular, by considering the relationship between the target and its surroundings, the spatial context information helps the tracker to better distinguish the target from the background, especially when it suffers from scale change, shape variation, occlusion, and background clutter. Meanwhile, the temporal contextual cues are beneficial for building a stable appearance representation for the target, which enables the tracker to be robust against occlusion and drifting.
In this regard, we attempt to develop effective methods that take advantage of the spatial and temporal context to improve the tracking algorithms. Our proposed methods can benefit three kinds of mainstream tracking frameworks, namely the template-based generative tracking framework, the pixel-wise tracking framework and the tracking-by-detection framework. For the template-based generative tracking framework, a novel template based tracker is proposed that enhances the existing appearance model of the target by introducing mask templates. In particular, mask templates store the temporal context represented by the frame difference in various time scales, and other templates encode the spatial context. Then, using pixel-wise analytic tools which provide richer details, which naturally accommodates tracking tasks, a finer and more accurate tracker is proposed. It makes use of two convolutional neural networks to capture both the spatial and temporal context. Lastly, for a visual tracker with a tracking-by-detection strategy, we propose an effective and efficient module that can improve the quality of the candidate windows sampled to identify the target. By utilizing the context around the object, our proposed module is able to refine the location and dimension of each candidate window, thus helping the tracker better focus on the target object
Fast Visual Tracking Using Spatial Temporal Background Context Learning
Visual Tracking by now has gained much provenience among researchers in recent years due to its vast variety of applications that occur in daily life. Various applications of visual tracking include counting of cars on a high way, analyzing the crowd intensity in a concert or a football ground or a surveillance camera tracking a single person to track its movements. Various techniques have been proposed and implemented in this research domain where researchers have analyzed various parameters. Still this area has a lot to offer. There are two common approaches that are currently deployed in visual tracking. One is discriminative tracking and the other one is generative tracking. Discriminative tracking requires a pre-trained model that requires the learning of the data and solves the object recognition as a binary classification problem. On the other hand, generative model in tracking makes use of the previous states so that next state can be predicted. In this paper, a novel tacking based on generative tracking method is proposed called as Illumination Inavariant Spatio Temporal Tracker (IISTC). The proposed technique takes into account of the nearby surrounding regions and performs context learning so that the state of the object under consideration and its surrounding regions can be estimated in the next frame. The learning model is deployed both in the spatial domain as well as the temporal domain. Spatial domain part of the tracker takes into consideration the nearby pixels in a frame while the temporal model takes account of the possible change of object location. The proposed tracker was tested on a set of 50 images against other state of the art four trackers. Experimental results reveal that our proposed tracker performs reasonably well as compared with other trackers. The proposed visual tracker is both efficiently with respect to computation power as well as accuracy. The proposed tracker takes only 4 fast Fourier transform computations thus making it reasonably faster. The proposed trackers perform exceptionally well when there is a sudden change in back ground illumination
Adaptive visual sampling
PhDVarious visual tasks may be analysed in the context of sampling from the visual field. In visual
psychophysics, human visual sampling strategies have often been shown at a high-level to
be driven by various information and resource related factors such as the limited capacity of
the human cognitive system, the quality of information gathered, its relevance in context and
the associated efficiency of recovering it. At a lower-level, we interpret many computer vision
tasks to be rooted in similar notions of contextually-relevant, dynamic sampling strategies
which are geared towards the filtering of pixel samples to perform reliable object association. In
the context of object tracking, the reliability of such endeavours is fundamentally rooted in the
continuing relevance of object models used for such filtering, a requirement complicated by realworld
conditions such as dynamic lighting that inconveniently and frequently cause their rapid
obsolescence. In the context of recognition, performance can be hindered by the lack of learned
context-dependent strategies that satisfactorily filter out samples that are irrelevant or blunt the
potency of models used for discrimination. In this thesis we interpret the problems of visual
tracking and recognition in terms of dynamic spatial and featural sampling strategies and, in this
vein, present three frameworks that build on previous methods to provide a more flexible and
effective approach.
Firstly, we propose an adaptive spatial sampling strategy framework to maintain statistical object
models for real-time robust tracking under changing lighting conditions. We employ colour
features in experiments to demonstrate its effectiveness. The framework consists of five parts:
(a) Gaussian mixture models for semi-parametric modelling of the colour distributions of multicolour
objects; (b) a constructive algorithm that uses cross-validation for automatically determining
the number of components for a Gaussian mixture given a sample set of object colours; (c) a
sampling strategy for performing fast tracking using colour models; (d) a Bayesian formulation
enabling models of object and the environment to be employed together in filtering samples by
discrimination; and (e) a selectively-adaptive mechanism to enable colour models to cope with
changing conditions and permit more robust tracking.
Secondly, we extend the concept to an adaptive spatial and featural sampling strategy to deal
with very difficult conditions such as small target objects in cluttered environments undergoing
severe lighting fluctuations and extreme occlusions. This builds on previous work on dynamic
feature selection during tracking by reducing redundancy in features selected at each stage as
well as more naturally balancing short-term and long-term evidence, the latter to facilitate model
rigidity under sharp, temporary changes such as occlusion whilst permitting model flexibility
under slower, long-term changes such as varying lighting conditions. This framework consists of
two parts: (a) Attribute-based Feature Ranking (AFR) which combines two attribute measures;
discriminability and independence to other features; and (b) Multiple Selectively-adaptive Feature
Models (MSFM) which involves maintaining a dynamic feature reference of target object
appearance. We call this framework Adaptive Multi-feature Association (AMA). Finally, we present an adaptive spatial and featural sampling strategy that extends established
Local Binary Pattern (LBP) methods and overcomes many severe limitations of the traditional
approach such as limited spatial support, restricted sample sets and ad hoc joint and disjoint statistical
distributions that may fail to capture important structure. Our framework enables more
compact, descriptive LBP type models to be constructed which may be employed in conjunction
with many existing LBP techniques to improve their performance without modification. The
framework consists of two parts: (a) a new LBP-type model known as Multiscale Selected Local
Binary Features (MSLBF); and (b) a novel binary feature selection algorithm called Binary Histogram
Intersection Minimisation (BHIM) which is shown to be more powerful than established
methods used for binary feature selection such as Conditional Mutual Information Maximisation
(CMIM) and AdaBoost
In Defense of Clip-based Video Relation Detection
Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm
A Deep-structured Conditional Random Field Model for Object Silhouette Tracking
In this work, we introduce a deep-structured conditional random field
(DS-CRF) model for the purpose of state-based object silhouette tracking. The
proposed DS-CRF model consists of a series of state layers, where each state
layer spatially characterizes the object silhouette at a particular point in
time. The interactions between adjacent state layers are established by
inter-layer connectivity dynamically determined based on inter-frame optical
flow. By incorporate both spatial and temporal context in a dynamic fashion
within such a deep-structured probabilistic graphical model, the proposed
DS-CRF model allows us to develop a framework that can accurately and
efficiently track object silhouettes that can change greatly over time, as well
as under different situations such as occlusion and multiple targets within the
scene. Experiment results using video surveillance datasets containing
different scenarios such as occlusion and multiple targets showed that the
proposed DS-CRF approach provides strong object silhouette tracking performance
when compared to baseline methods such as mean-shift tracking, as well as
state-of-the-art methods such as context tracking and boosted particle
filtering.Comment: 17 page
Perceptual Context in Cognitive Hierarchies
Cognition does not only depend on bottom-up sensor feature abstraction, but
also relies on contextual information being passed top-down. Context is higher
level information that helps to predict belief states at lower levels. The main
contribution of this paper is to provide a formalisation of perceptual context
and its integration into a new process model for cognitive hierarchies. Several
simple instantiations of a cognitive hierarchy are used to illustrate the role
of context. Notably, we demonstrate the use context in a novel approach to
visually track the pose of rigid objects with just a 2D camera
- …