46 research outputs found
Deep Feature Learning and Adaptation for Computer Vision
We are living in times when a revolution of deep learning is taking place. In general, deep learning models have a backbone that extracts features from the input data followed by task-specific layers, e.g. for classification. This dissertation proposes various deep feature extraction and adaptation methods to improve task-specific learning, such as visual re-identification, tracking, and domain adaptation. The vehicle re-identification (VRID) task requires identifying a given vehicle among a set of vehicles under variations in viewpoint, illumination, partial occlusion, and background clutter. We propose a novel local graph aggregation module for feature extraction to improve VRID performance. We also utilize a class-balanced loss to compensate for the unbalanced class distribution in the training dataset. Overall, our framework achieves state-of-the-art (SOTA) performance in multiple VRID benchmarks. We further extend our VRID method for visual object tracking under occlusion conditions. We motivate visual object tracking from aerial platforms by conducting a benchmarking of tracking methods on aerial datasets. Our study reveals that the current techniques have limited capabilities to re-identify objects when fully occluded or out of view. The Siamese network based trackers perform well compared to others in overall tracking performance. We utilize our VRID work in visual object tracking and propose Siam-ReID, a novel tracking method using a Siamese network and VRID technique. In another approach, we propose SiamGauss, a novel Siamese network with a Gaussian Head for improved confuser suppression and real time performance. Our approach achieves SOTA performance on aerial visual object tracking datasets. A related area of research is developing deep learning based domain adaptation techniques. We propose continual unsupervised domain adaptation, a novel paradigm for domain adaptation in data constrained environments. We show that existing works fail to generalize when the target domain data are acquired in small batches. We propose to use a buffer to store samples that are previously seen by the network and a novel loss function to improve the performance of continual domain adaptation. We further extend our continual unsupervised domain adaptation research for gradually varying domains. Our method outperforms several SOTA methods even though they have the entire domain data available during adaptation
Visual object tracking in dynamic scenes
Visual object tracking is a fundamental task in the field computer vision. Visual object tracking
is widely used in numerous applications which include, but are not limited to video surveillance,
image understanding, robotics, and human-computer interaction. In essence, visual object
tracking is the problem of estimating the states/trajectory of the object of interest over time.
Unlike other tasks such as object detection where the number of classes/categories are defined
beforehand, the only available information of the object of interest is at the first frame.
Even though, Deep Learning (DL) has revolutionised most computer vision tasks, visual
object tracking still imposes several challenges. The nature of visual object tracking task is
stochastic, where no prior-knowledge is available about the object of interest during the training
or testing/inference. Moreover, visual object tracking is a class-agnostic task, as opposed object
detection and segmentation tasks. In this thesis, the main objective is to develop and advance
the visual object trackers using novel designs of deep learning frameworks and mathematical
formulations.
To take advantage of different trackers, a novel framework is developed to track moving
objects based on a composite framework and a reporter mechanism. The composite framework
has built-in trackers and user-defined trackers to track the object of interest. The framework
contains a module to calculate the robustness for each tracker and a reporter mechanism serves
as a recovery mechanism if trackers fail to locate the object of interest.
Different trackers may fail to track the object of interest, thus, a more robust framework
based on Siamese network architecture, namely DensSiam, is proposed to use the concept of dense layers and connects each dense layer in the network to all layers in a feed-forward fashion
with a similarity-learning function. DensSiam also includes a Self-Attention mechanism to
force the network to pay more attention to non-local features during offline training.
Generally, Siamese trackers do not fully utilize semantic and objectness information from
pre-trained networks that have been trained on an image classification task. To solve this problem
a novel architecture design is proposed , dubbed DomainSiam, to learn a Domain-Aware
that fully utilizes semantic and objectness information while producing a class-agnostic track
using a ridge regression network. Moreover, to reduce the sparsity problem, we solve the ridge
regression problem with a differentiable weighted-dynamic loss function.
Siamese trackers have high speed and work in real-time, however, they lack high accuracy.
To overcome this challenge, a novel dynamic policy gradient Agent-Environment architecture
with Siamese network (DP-Siam) is proposed to train the tracker to increase the accuracy and
the expected average overlap while running in real-time. DP-Siam is trained offline with reinforcement
learning to produce a continuous action that predicts the optimal object location.
One of the common design block in most object trackers in the literature is the backbone
network, where the backbone network is trained in the feature space. To design a backbone
network that maps from feature space to another space (i.e., joint-nullspace) and more suitable
for object tracking and classification, a novel framework is proposed. The new framework is
called NullSpaceNet has a clear interpretation for the feature representation and the features in
this space are more separable. NullSpaceNet is utilized in object tracking by regularizing the
discriminative joint-nullspace backbone network. The novel tracker is called NullSpaceRDAR,
and encourages the network to have a representation for the target-specific information for the
object of interest in the joint-nullspace. In contrast to feature space where objects from a specific
class are categorized into one category however, it is insensitive to intra-class variations. Furthermore, we use the NullSpaceNet backbone to learn a tracker, dubbed NullSpaceRDAR,
with a regularized discriminative joint-nullspace backbone network that is specifically
designed for object tracking. In the regularized discriminative joint-nullspace, the features from
the same target-specific are collapsed into one point in the joint-null space and different targetspecific
features are collapsed into different points in the joint-nullspace. Consequently, the
joint-nullspace forces the network to be sensitive to the variations of the object from the same
class (intra-class variations). Moreover, a dynamic adaptive loss function is proposed to select
the suitable loss function from a super-set family of losses based on the training data to make
NullSpaceRDAR more robust to different challenges
Hard Negative Samples Emphasis Tracker without Anchors
Trackers based on Siamese network have shown tremendous success, because of
their balance between accuracy and speed. Nevertheless, with tracking scenarios
becoming more and more sophisticated, most existing Siamese-based approaches
ignore the addressing of the problem that distinguishes the tracking target
from hard negative samples in the tracking phase. The features learned by these
networks lack of discrimination, which significantly weakens the robustness of
Siamese-based trackers and leads to suboptimal performance. To address this
issue, we propose a simple yet efficient hard negative samples emphasis method,
which constrains Siamese network to learn features that are aware of hard
negative samples and enhance the discrimination of embedding features. Through
a distance constraint, we force to shorten the distance between exemplar vector
and positive vectors, meanwhile, enlarge the distance between exemplar vector
and hard negative vectors. Furthermore, we explore a novel anchor-free tracking
framework in a per-pixel prediction fashion, which can significantly reduce the
number of hyper-parameters and simplify the tracking process by taking full
advantage of the representation of convolutional neural network. Extensive
experiments on six standard benchmark datasets demonstrate that the proposed
method can perform favorable results against state-of-the-art approaches.Comment: accepted by ACM Mutlimedia Conference, 202
Adversarial Attacks on Video Object Segmentation with Hard Region Discovery
Video object segmentation has been applied to various computer vision tasks,
such as video editing, autonomous driving, and human-robot interaction.
However, the methods based on deep neural networks are vulnerable to
adversarial examples, which are the inputs attacked by almost
human-imperceptible perturbations, and the adversary (i.e., attacker) will fool
the segmentation model to make incorrect pixel-level predictions. This will
rise the security issues in highly-demanding tasks because small perturbations
to the input video will result in potential attack risks. Though adversarial
examples have been extensively used for classification, it is rarely studied in
video object segmentation. Existing related methods in computer vision either
require prior knowledge of categories or cannot be directly applied due to the
special design for certain tasks, failing to consider the pixel-wise region
attack. Hence, this work develops an object-agnostic adversary that has
adversarial impacts on VOS by first-frame attacking via hard region discovery.
Particularly, the gradients from the segmentation model are exploited to
discover the easily confused region, in which it is difficult to identify the
pixel-wise objects from the background in a frame. This provides a hardness map
that helps to generate perturbations with a stronger adversarial power for
attacking the first frame. Empirical studies on three benchmarks indicate that
our attacker significantly degrades the performance of several state-of-the-art
video object segmentation models