86 research outputs found
SELF-ADAPTING PARALLEL FRAMEWORK FOR LONG-TERM OBJECT TRACKING
Object tracking is a crucial field in computer vision that has many uses in human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, etc. Many implementations are introduced in practice, and yet recent methods emphasize on tracking objects adaptively by learning the object’s perspectives and rediscovering it when it becomes untraceable, so that object’s absence problem (in case of occlusion, cluttering or blurring) is resolved. Most of these algorithms have high computational burden on the computational units and need powerful CPUs to attain real-time tracking and high bitrate video processing. These computational units may handle no more than a single video source, making it unsuitable for large-scale implementations like multiple sources or higher resolution videos. In this thesis, we choose one popular algorithm called TLD, Tracking-Learning-Detection, study the core components of the algorithm that impede its performance, and implement these components in a parallel computational environment such as multi-core CPUs, GPUs, etc., also known as heterogeneous computing. OpenCL is used as a development platform to produce parallel kernels for the algorithm. The goals are to create an acceptable heterogeneous computing environment through utilizing current computer technologies, to imbue real-time applications with an alternative implementation methodology, and to circumvent the upcoming limitations of hardware in terms of cost, power, and speedup. We are able to bring true parallel speedup to the existing implementations, which greatly improves the frame rate for long-term object tracking and with some algorithm parameter modification, it provides more accurate object tracking. According to the experiments, developed kernels have achieved a range of performance improvement. As for reduction based kernels, a maximum of 78X speedup is achieved. While for window based kernels, a range of couple hundreds to 2000X speedup is achieved. And for the optical flow tracking kernel, a maximum of 5.7X speedup is recorded. Global speedup is highly dependent on the hardware specifications, especially for memory transfers. With the use of a medium sized input, the self-adapting parallel framework has successfully obtained a fast learning curve and converged to an average of 1.6X speedup compared to the original implementation. Lastly, for future programming convenience, an OpenCL based library is built to facilitate the use of OpenCL programming on parallel hardware devices, hide the complexity of building and compiling OpenCL kernels, and provide a C-based latency measurement tool that is compatible with several operating systems
Multi-Template Temporal Siamese Network for Long-Term Object Tracking
Siamese Networks are one of most popular visual object tracking methods for
their high speed and high accuracy tracking ability as long as the target is
well identified. However, most Siamese Network based trackers use the first
frame as the ground truth of an object and fail when target appearance changes
significantly in next frames. They also have dif iculty distinguishing the
target from similar other objects in the frame. We propose two ideas to solve
both problems. The first idea is using a bag of dynamic templates, containing
diverse, similar, and recent target features and continuously updating it with
diverse target appearances. The other idea is to let a network learn the path
history and project a potential future target location in a next frame. This
tracker achieves state-of-the-art performance on the long-term tracking dataset
UAV20L by improving the success rate by a large margin of 15% (65.4 vs 56.6)
compared to the state-of-the-art method, HiFT. The of icial python code of this
paper is publicly available
Long-term object tracking using region proposals
In this thesis we address the problem of tracking an arbitrary object in a sequence of images. We propose a long-term tracker based on the use of Siamese convolutional neural networks. For detection, we use a template with which we compute cross correlation on every point of the search image to find the best matching region. The template is initialized on the first frame, where we crop the image so that it represents only the tracking object and input it to the convolutional neural network. After each localization the tracker detects if tracking has failed. We propose two online methods of updating the visual model. One updates the template and the other fine tunes the parameters of the network. We carried out two analysis, where we measure long-term tracking performance on dataset LTB35 on modifications of our tracker. With the first analysis we find out what is a good setting for generating region proposals. The purpose of the second analysis is to test the proposed methods for updating the visual model. We find out that without updating the visual model, our tracker achieves F-measure of 0.34, when updating the template 0.22, when fine tuning 0.38 and with both methods we get 0.20. Finally we compared the performance of our tracker with the trackers submitted in the VOT-LT2018 challange, and achieved 11th place when fine tuning and 12th without fine tuning or updating the template
Long-term object tracking using region proposals
In this thesis we address the problem of tracking an arbitrary object in a sequence of images. We propose a long-term tracker based on the use of Siamese convolutional neural networks. For detection, we use a template with which we compute cross correlation on every point of the search image to find the best matching region. The template is initialized on the first frame, where we crop the image so that it represents only the tracking object and input it to the convolutional neural network. After each localization the tracker detects if tracking has failed. We propose two online methods of updating the visual model. One updates the template and the other fine tunes the parameters of the network. We carried out two analysis, where we measure long-term tracking performance on dataset LTB35 on modifications of our tracker. With the first analysis we find out what is a good setting for generating region proposals. The purpose of the second analysis is to test the proposed methods for updating the visual model. We find out that without updating the visual model, our tracker achieves F-measure of 0.34, when updating the template 0.22, when fine tuning 0.38 and with both methods we get 0.20. Finally we compared the performance of our tracker with the trackers submitted in the VOT-LT2018 challange, and achieved 11th place when fine tuning and 12th without fine tuning or updating the template
DART: Distribution Aware Retinal Transform for Event-based Cameras
We introduce a generic visual descriptor, termed as distribution aware
retinal transform (DART), that encodes the structural context using log-polar
grids for event cameras. The DART descriptor is applied to four different
problems, namely object classification, tracking, detection and feature
matching: (1) The DART features are directly employed as local descriptors in a
bag-of-features classification framework and testing is carried out on four
standard event-based object datasets (N-MNIST, MNIST-DVS, CIFAR10-DVS,
NCaltech-101). (2) Extending the classification system, tracking is
demonstrated using two key novelties: (i) For overcoming the low-sample problem
for the one-shot learning of a binary classifier, statistical bootstrapping is
leveraged with online learning; (ii) To achieve tracker robustness, the scale
and rotation equivariance property of the DART descriptors is exploited for the
one-shot learning. (3) To solve the long-term object tracking problem, an
object detector is designed using the principle of cluster majority voting. The
detection scheme is then combined with the tracker to result in a high
intersection-over-union score with augmented ground truth annotations on the
publicly available event camera dataset. (4) Finally, the event context encoded
by DART greatly simplifies the feature correspondence problem, especially for
spatio-temporal slices far apart in time, which has not been explicitly tackled
in the event-based vision domain.Comment: 12 pages, revision submitted to TPAMI in Nov 201
Visual motion tracking and sensor fusion for kite power systems
An estimation approach is presented for kite power systems with groundbased actuation and generation. Line-based estimation of the kite state, including position and heading, limits the achievable cycle efficiency of such airborne wind energy systems due to significant estimation delay and line sag. We propose a filtering scheme to fuse onboard inertial measurements with ground-based line data for ground-based systems in pumping operation. Estimates are computed using an extended Kalman filtering scheme with a sensor-driven kinematic process model which propagates and corrects for inertial sensor biases. We further propose a visual motion tracking approach to extract estimates of the kite position from ground-based video streams. The approach combines accurate object detection with fast motion tracking to ensure long-term object tracking in real time. We present experimental results of the visual motion tracking and inertial sensor fusion on a ground-based kite power system in pumping operation and compare both methods to an existing estimation scheme based on line measurements
Memory Based Online Learning of Deep Representations from Video Streams
We present a novel online unsupervised method for face identity learning from
video streams. The method exploits deep face descriptors together with a memory
based learning mechanism that takes advantage of the temporal coherence of
visual data. Specifically, we introduce a discriminative feature matching
solution based on Reverse Nearest Neighbour and a feature forgetting strategy
that detect redundant features and discard them appropriately while time
progresses. It is shown that the proposed learning procedure is asymptotically
stable and can be effectively used in relevant applications like multiple face
identification and tracking from unconstrained video streams. Experimental
results show that the proposed method achieves comparable results in the task
of multiple face tracking and better performance in face identification with
offline approaches exploiting future information. Code will be publicly
available.Comment: arXiv admin note: text overlap with arXiv:1708.0361
Surveillance with UAV Videos
Unmanned aerial vehicles (UAVs) and drones are now accessible to everyone and are widely used in civilian and military fields. In military applications, UAVs can be used in border surveillance to detect or track any moving object/target. The challenge of processing UAV images is the unpredictable background motions due to camera movement and small target sizes. In this chapter, a short literature brief will be discussed for moving object detection and long-term object tracking. Publicly available datasets in the literature are introduced. General approaches and success rates in the proposed methods are evaluated and approach to how deep learning-based solutions can be used together with classical methods are discussed. In addition to the methods in the literature for moving object detection problems, possible solution approaches for the challenges are also shared
e-TLD: Event-based Framework for Dynamic Object Tracking
This paper presents a long-term object tracking framework with a moving event
camera under general tracking conditions. A first of its kind for these
revolutionary cameras, the tracking framework uses a discriminative
representation for the object with online learning, and detects and re-tracks
the object when it comes back into the field-of-view. One of the key novelties
is the use of an event-based local sliding window technique that tracks
reliably in scenes with cluttered and textured background. In addition,
Bayesian bootstrapping is used to assist real-time processing and boost the
discriminative power of the object representation. On the other hand, when the
object re-enters the field-of-view of the camera, a data-driven, global sliding
window detector locates the object for subsequent tracking. Extensive
experiments demonstrate the ability of the proposed framework to track and
detect arbitrary objects of various shapes and sizes, including dynamic objects
such as a human. This is a significant improvement compared to earlier works
that simply track objects as long as they are visible under simpler background
settings. Using the ground truth locations for five different objects under
three motion settings, namely translation, rotation and 6-DOF, quantitative
measurement is reported for the event-based tracking framework with critical
insights on various performance issues. Finally, real-time implementation in
C++ highlights tracking ability under scale, rotation, view-point and occlusion
scenarios in a lab setting.Comment: 11 pages, 10 figure
In Defense of Clip-based Video Relation Detection
Video Visual Relation Detection (VidVRD) aims to detect visual relationship
triplets in videos using spatial bounding boxes and temporal boundaries.
Existing VidVRD methods can be broadly categorized into bottom-up and top-down
paradigms, depending on their approach to classifying relations. Bottom-up
methods follow a clip-based approach where they classify relations of short
clip tubelet pairs and then merge them into long video relations. On the other
hand, top-down methods directly classify long video tubelet pairs. While recent
video-based methods utilizing video tubelets have shown promising results, we
argue that the effective modeling of spatial and temporal context plays a more
significant role than the choice between clip tubelets and video tubelets. This
motivates us to revisit the clip-based paradigm and explore the key success
factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM)
that enriches the object-based spatial context and relation-based temporal
context based on clips. We demonstrate that using clip tubelets can achieve
superior performance compared to most video-based methods. Additionally, using
clip tubelets offers more flexibility in model designs and helps alleviate the
limitations associated with video tubelets, such as the challenging long-term
object tracking problem and the loss of temporal information in long-term
tubelet feature compression. Extensive experiments conducted on two challenging
VidVRD benchmarks validate that our HCM achieves a new state-of-the-art
performance, highlighting the effectiveness of incorporating advanced spatial
and temporal context modeling within the clip-based paradigm
- …