46 research outputs found

    Deep Feature Learning and Adaptation for Computer Vision

    Get PDF
    We are living in times when a revolution of deep learning is taking place. In general, deep learning models have a backbone that extracts features from the input data followed by task-specific layers, e.g. for classification. This dissertation proposes various deep feature extraction and adaptation methods to improve task-specific learning, such as visual re-identification, tracking, and domain adaptation. The vehicle re-identification (VRID) task requires identifying a given vehicle among a set of vehicles under variations in viewpoint, illumination, partial occlusion, and background clutter. We propose a novel local graph aggregation module for feature extraction to improve VRID performance. We also utilize a class-balanced loss to compensate for the unbalanced class distribution in the training dataset. Overall, our framework achieves state-of-the-art (SOTA) performance in multiple VRID benchmarks. We further extend our VRID method for visual object tracking under occlusion conditions. We motivate visual object tracking from aerial platforms by conducting a benchmarking of tracking methods on aerial datasets. Our study reveals that the current techniques have limited capabilities to re-identify objects when fully occluded or out of view. The Siamese network based trackers perform well compared to others in overall tracking performance. We utilize our VRID work in visual object tracking and propose Siam-ReID, a novel tracking method using a Siamese network and VRID technique. In another approach, we propose SiamGauss, a novel Siamese network with a Gaussian Head for improved confuser suppression and real time performance. Our approach achieves SOTA performance on aerial visual object tracking datasets. A related area of research is developing deep learning based domain adaptation techniques. We propose continual unsupervised domain adaptation, a novel paradigm for domain adaptation in data constrained environments. We show that existing works fail to generalize when the target domain data are acquired in small batches. We propose to use a buffer to store samples that are previously seen by the network and a novel loss function to improve the performance of continual domain adaptation. We further extend our continual unsupervised domain adaptation research for gradually varying domains. Our method outperforms several SOTA methods even though they have the entire domain data available during adaptation

    Visual object tracking in dynamic scenes

    Get PDF
    Visual object tracking is a fundamental task in the field computer vision. Visual object tracking is widely used in numerous applications which include, but are not limited to video surveillance, image understanding, robotics, and human-computer interaction. In essence, visual object tracking is the problem of estimating the states/trajectory of the object of interest over time. Unlike other tasks such as object detection where the number of classes/categories are defined beforehand, the only available information of the object of interest is at the first frame. Even though, Deep Learning (DL) has revolutionised most computer vision tasks, visual object tracking still imposes several challenges. The nature of visual object tracking task is stochastic, where no prior-knowledge is available about the object of interest during the training or testing/inference. Moreover, visual object tracking is a class-agnostic task, as opposed object detection and segmentation tasks. In this thesis, the main objective is to develop and advance the visual object trackers using novel designs of deep learning frameworks and mathematical formulations. To take advantage of different trackers, a novel framework is developed to track moving objects based on a composite framework and a reporter mechanism. The composite framework has built-in trackers and user-defined trackers to track the object of interest. The framework contains a module to calculate the robustness for each tracker and a reporter mechanism serves as a recovery mechanism if trackers fail to locate the object of interest. Different trackers may fail to track the object of interest, thus, a more robust framework based on Siamese network architecture, namely DensSiam, is proposed to use the concept of dense layers and connects each dense layer in the network to all layers in a feed-forward fashion with a similarity-learning function. DensSiam also includes a Self-Attention mechanism to force the network to pay more attention to non-local features during offline training. Generally, Siamese trackers do not fully utilize semantic and objectness information from pre-trained networks that have been trained on an image classification task. To solve this problem a novel architecture design is proposed , dubbed DomainSiam, to learn a Domain-Aware that fully utilizes semantic and objectness information while producing a class-agnostic track using a ridge regression network. Moreover, to reduce the sparsity problem, we solve the ridge regression problem with a differentiable weighted-dynamic loss function. Siamese trackers have high speed and work in real-time, however, they lack high accuracy. To overcome this challenge, a novel dynamic policy gradient Agent-Environment architecture with Siamese network (DP-Siam) is proposed to train the tracker to increase the accuracy and the expected average overlap while running in real-time. DP-Siam is trained offline with reinforcement learning to produce a continuous action that predicts the optimal object location. One of the common design block in most object trackers in the literature is the backbone network, where the backbone network is trained in the feature space. To design a backbone network that maps from feature space to another space (i.e., joint-nullspace) and more suitable for object tracking and classification, a novel framework is proposed. The new framework is called NullSpaceNet has a clear interpretation for the feature representation and the features in this space are more separable. NullSpaceNet is utilized in object tracking by regularizing the discriminative joint-nullspace backbone network. The novel tracker is called NullSpaceRDAR, and encourages the network to have a representation for the target-specific information for the object of interest in the joint-nullspace. In contrast to feature space where objects from a specific class are categorized into one category however, it is insensitive to intra-class variations. Furthermore, we use the NullSpaceNet backbone to learn a tracker, dubbed NullSpaceRDAR, with a regularized discriminative joint-nullspace backbone network that is specifically designed for object tracking. In the regularized discriminative joint-nullspace, the features from the same target-specific are collapsed into one point in the joint-null space and different targetspecific features are collapsed into different points in the joint-nullspace. Consequently, the joint-nullspace forces the network to be sensitive to the variations of the object from the same class (intra-class variations). Moreover, a dynamic adaptive loss function is proposed to select the suitable loss function from a super-set family of losses based on the training data to make NullSpaceRDAR more robust to different challenges

    Hard Negative Samples Emphasis Tracker without Anchors

    Full text link
    Trackers based on Siamese network have shown tremendous success, because of their balance between accuracy and speed. Nevertheless, with tracking scenarios becoming more and more sophisticated, most existing Siamese-based approaches ignore the addressing of the problem that distinguishes the tracking target from hard negative samples in the tracking phase. The features learned by these networks lack of discrimination, which significantly weakens the robustness of Siamese-based trackers and leads to suboptimal performance. To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. Through a distance constraint, we force to shorten the distance between exemplar vector and positive vectors, meanwhile, enlarge the distance between exemplar vector and hard negative vectors. Furthermore, we explore a novel anchor-free tracking framework in a per-pixel prediction fashion, which can significantly reduce the number of hyper-parameters and simplify the tracking process by taking full advantage of the representation of convolutional neural network. Extensive experiments on six standard benchmark datasets demonstrate that the proposed method can perform favorable results against state-of-the-art approaches.Comment: accepted by ACM Mutlimedia Conference, 202

    Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

    Full text link
    Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models
    corecore