46 research outputs found

    Scalable Deep Learning Architecture Design.

    Get PDF
    PhD ThesesThe past decade has witnessed a rapid development in deep learning research which has enabled remarkable progress on a wide spectrum of computer vision tasks, such as object recognition, segmentation, and detection. One generic mechanism for deep learning on computer vision is to design optimal deep neural architectures for given tasks, so as to learn compact, rich and expressive features for data collected by artificial visual sensors. Nonetheless, deep artificial neural architecture design for computer vision tasks remains challenging due to the inherent visual task complexity and uncertainty. One can not guarantee that a specific network designed for one task assumably works well for new tasks, especially when it comes to considering scalability (the model size, learning capacity and efficiency, and domain adaptation to new data). Unfortunately, there are no theoretical principles towards guiding deep neural architecture design, which makes researchers having to rely on their own expertise and experience ad hoc. This thesis investigates approaches to designing deep neural architectures for several tasks by considering the underlying task characteristics for more efficient and powerful deep models. More specifically, this thesis develops new methods for addressing four different problems as follows: Chapter 3 The first problem is harmonious attention network design for scalable person reidentification (re-id). Existing person re-identification (re-id) deep learning methods rely heavily on the utilisation of large and computationally expensive convolutional neural networks. They are therefore not scalable to large scale re-id deployment scenarios with the need of processing a large amount of surveillance video data, due to the lengthy inference process with high computing costs. in this chapter, we address this limitation via jointly learning re-id attention selection. Specifically, we formulate a novel Harmonious Attention Network (HAN) framework to jointly learn soft pixel attention and hard regional attention alongside simultaneous deep feature representation learning, particularly enabling more discriminative re-id matching by efficient networks with more scalable inference. Extensive evaluations validate the cost-effectiveness superiority of the proposed HAN approach for person re-id against a wide variety of state-of-the-art methods on large benchmark datasets. Chapter 4 The second problem is hierarchical distillation network design for scalable person search. Existing person search methods typically focus on improving person detection accuracy. This ignores the model inference efficiency, which however is fundamentally significant for real-world applications. in this chapter, we address this limitation by investigating the scalability problem of person search involving both model accuracy and inference efficiency simultaneously. Specifically, we formulate a Hierarchical Distillation Learning (HDL) approach. With HDL, we aim to comprehensively distil the knowledge of a strong teacher model with strong learning capability to a lightweight student model with weak learning capability. To facilitate the HDL process, we design a simple and powerful teacher model for joint learning of person detection and person re-identification matching in unconstrained scene images. Extensive experiments show the modelling advantages and cost-effectiveness superiority of HDL over the 4 state-of-the-art person search methods on large person search benchmarks. Chapter 5 The third problem is neural graph embedding for scalable neural architecture search. Existing neural architecture search (NAS) methods often operate in discrete or continuous spaces directly, which ignores the graphical topology knowledge of neural networks. This leads to suboptimal search performance and efficiency, given that neural networks are essentially directed acyclic graphs (DAG). in this chapter, we address this limitation by introducing a novel idea of neural graph embedding (NGE). Specifically, we represent the building block (i.e. the cell) of neural networks with a neural DAG, and learn it by leveraging a Graph Convolutional Network to propagate and model the intrinsic topology information of network architectures. This results in a generic neural network representation integrable with different existing NAS frameworks. Extensive experiments show the superiority of NGE over the state-of-the-art methods on image classification and semantic segmentation. Chapter 6 The last problem is scalable neural operator search. Existing neural architecture search (NAS) methods explore a limited feature-transformation-only search space, ignoring other advanced feature operations such as feature self-calibration by attention and dynamic convolutions. This disables the NAS algorithms from discovering more optimal network architectures. We address this limitation by additionally exploiting feature self-calibration operations, resulting in a heterogeneous search space. To overcome the challenges of operation heterogeneity and significantly larger search space, we formulate a neural operator search (NOS) method. NOS presents a novel heterogeneous residual block for integrating the heterogeneous operations in a unified structure, and an attention guided search strategy for facilitating the search process over a vast space. Extensive experiments show that NOS can search novel cell architectures with highly competitive performance on the CIFAR and ImageNet benchmarks. Chapter 6 includes concluding remarks and discusses potential areas for future research and extensions

    Pose-Guided Semantic Person Re-Identification in Surveillance Data

    Get PDF

    Person Re-identification and Tracking in Video Surveillance

    Get PDF
    Video surveillance system is one of the most essential topics in the computer vision field. As the rapid and continuous increasement of using video surveillance cameras to obtain portrait information in scenes, it becomes a very important system for security and criminal investigations. Video surveillance system includes many key technologies, including the object recognition, the object localization, the object re-identification, object tracking, and by which the system can be used to identify or suspect the movements of the objects and persons. In recent years, person re-identification and visual object tracking have become hot research directions in the computer vision field. The re-identification system aims to recognize and identify the target of the required attributes, and the tracking system aims at following and predicting the movement of the target after the identification process. Researchers have used deep learning and computer vision technologies to significantly improve the performance of person re-identification. However, the study of person re-identification is still challenging due to complex application environments such as lightning variations, complex background transformations, low-resolution images, occlusions, and a similar dressing of different pedestrians. The challenge of this task also comes from unavailable bounding boxes for pedestrians, and the need to search for the person over the whole gallery images. To address these critical issues in modern person identification applications, we propose an algorithm that can accurately localize persons by learning to minimize intra-person feature variations. We build our model upon the state-of-the-art object detection framework, i.e., faster R-CNN, so that high-quality region proposals for pedestrians can be produced in an online manner. In addition, to relieve the negative effects caused by varying visual appearances of the same individual, we introduce a novel center loss that can increase the intra-class compactness of feature representations. The engaged center loss encourages persons with the same identity to have similar feature characteristics. Besides the localization of a single person, we explore a more general visual object tracking problem. The main task of the visual object tracking is to predict the location and size of the tracking target accurately and reliably in subsequent image sequences when the target is given at the beginning of the sequence. A visual object tracking algorithm with high accuracy, good stability, and fast inference speed is necessary. In this thesis, we study the updating problem for two kinds of tracking algorithms among the mainstream tracking approaches, and improve the robustness and accuracy. Firstly, we extend the siamese tracker with a model updating mechanism to improve their tracking robustness. A siamese tracker uses a deep convolutional neural network to obtain features and compares the new frame features with the target features in the first frame. The candidate region with the highest similarity score is considered as the tracking result. However, these kinds of trackers are not robust against large target variation due to the no-update matching strategy during the whole tracking process. To combat this defect, we propose an ensemble siamese tracker, where the final similarity score is also affected by the similarity with tracking results in recent frames instead of solely considering the first frame. Tracking results in recent frames are used to adjust the model for a continuous target change. Meanwhile, we combine adaptive candidate sampling strategy and large displacement optical flow method to improve its performance further. Secondly, we investigate the classic correlation filter based tracking algorithm and propose to provide a better model selection strategy by reinforcement learning. Correlation filter has been proven to be a useful tool for a number of approaches in visual tracking, particularly for seeking a good balance between tracking accuracy and speed. However, correlation filter based models are susceptible to wrong updates stemming from inaccurate tracking results. To date, little effort has been devoted to handling the correlation filter update problem. In our approach, we update and maintain multiple correlation filter models in parallel, and we use deep reinforcement learning for the selection of an optimal correlation filter model among them. To facilitate the decision process efficiently, we propose a decision-net to deal with target appearance modeling, which is trained through hundreds of challenging videos using proximal policy optimization and a lightweight learning network. An exhaustive evaluation of the proposed approach on the OTB100 and OTB2013 benchmarks show the effectiveness of our approach

    Knowledge Transfer in Object Recognition.

    Get PDF
    PhD Thesis.Abstract Object recognition is a fundamental and long-standing problem in computer vision. Since the latest resurgence of deep learning, thousands of techniques have been proposed and brought to commercial products to facilitate people’s daily life. Although remarkable achievements in object recognition have been witnessed, existing machine learning approaches remain far away from human vision system, especially in learning new concepts and Knowledge Transfer (KT) across scenarios. One main reason is that current learning approaches address isolated tasks by independently training predefined models, without considering any knowledge learned from previous tasks or models. In contrast, humans have an inherent ability to transfer the knowledge acquired from earlier tasks or people to new scenarios. Therefore, to scaling object recognition in realistic deployment, effective KT schemes are required. This thesis studies several aspects of KT for scaling object recognition systems. Specifically, to facilitate the KT process, several mechanisms on fine-grained and coarse-grained object recognition tasks are analyzed and studied, including 1) cross-class KT on person re-identification (reid); 2) cross-domain KT on person re-identification; 3) cross-model KT on image classification; 4) cross-task KT on image classification. In summary, four types of knowledge transfer schemes are discussed as follows: Chapter 3 Cross-class KT in person re-identification, one of representative fine-grained object recognition tasks, is firstly investigated. The nature of person identity classes for person re-id are totally disjoint between training and testing (a zero-shot learning problem), resulting in the highly demand of cross-class KT. To solve that, existing person re-id approaches aim to derive a feature representation for pairwise similarity based matching and ranking, which is able to generalise to test. However, current person re-id methods assume the provision of accurately cropped person bounding boxes and each of them is in the same resolution, ignoring the impact of the background noise and variant scale of images to cross-class KT. This is more severed in practice when person bounding boxes must be detected automatically given a very large number of images and/or videos (un-constrained scene images) processed. To address these challenges, this chapter provides two novel approaches, aiming to promote cross-class KT and boost re-id performance. 1) This chapter alleviates inaccurate person bounding box by developing a joint learning deep model that optimises person re-id attention selection within any auto-detected person bounding boxes by reinforcement learning of background clutter minimisation. Specifically, this chapter formulates a novel unified re-id architecture called Identity DiscriminativE Attention reinforcement Learning (IDEAL) to accurately select re-id attention in auto-detected bounding boxes for optimising re-id performance. 2) This chapter addresses multi-scale problem by proposing a Cross-Level Semantic Alignment (CLSA) deep learning approach capable of learning more discriminative identity feature representations in a unified end-to-end model. This 4 is realised by exploiting the in-network feature pyramid structure of a deep neural network enhanced by a novel cross pyramid-level semantic alignment loss function. Extensive experiments show the modelling advantages and performance superiority of both IDEAL and CLSA over the state-of-the-art re-id methods on widely used benchmarking datasets. Chapter 4 In this chapter, we address the problem of cross-domain KT in unsupervised domain adaptation for person re-id. Specifically, this chapter considers cross-domain KT as follows: 1) Unsupervised domain adaptation: “train once, run once” pattern, transferring knowledge from source domain to specific target domain and the model is restricted to be applied on target domain only; 2) Universal re-id: “train once, run everywhere” pattern, transferring knowledge from source domain to any target domains, and therefore is capable of deploying any domains of re-id task. This chapter firstly develops a novel Hierarchical Unsupervised Domain Adaptation (HUDA) method for unsupervised domain adaptation for re-id. It can automatically transfer labelled information of an existing dataset (a source domain) to an unlabelled target domain for unsupervised person re-id. Specifically, HUDA is designed to model jointly global distribution alignment and local instance alignment in a two-level hierarchy for discovering transferable source knowledge in unsupervised domain adaptation. Crucially, this approach aims to overcome the under-constrained learning problem of existing unsupervised domain adaptation methods, lacking of the local instance alignment constraint. The consequence is more effective and cross-domain KT from the labelled source domain to the unlabelled target domain. This chapter further addresses the limitation of “train once, run once ” for existing domain adaptation person re-id approaches by presenting a novel “train once, run everywhere” pattern. This conventional “train once, run once” pattern is unscalable to a large number of target domains typically encountered in real-world deployments, due to the requirement of training a separate model for each target domain as supervised learning methods. To mitigate this weakness, a novel “Universal Model Learning” (UML) approach is formulated to enable domain-generic person re-id using only limited training data of a “single” seed domain. Specifically, UML trains a universal re-id model to discriminate between a set of transformed person identity classes. Each of such classes is formed by applying a variety of random appearance transformations to the images of that class, where the transformations simulate camera viewing conditions of any domains for making the model domain generic. Chapter 5 The third problem considered in this thesis is cross-model KT in coarse-grained object recognition. This chapter discusses knowledge distillation in image classification. Knowledge distillation is an effective approach to transfer knowledge from a large teacher neural network to a small student (target) network for satisfying the low-memory and fast running requirements. Whilst being able to create stronger target networks compared to the vanilla non-teacher based learning strategy, this scheme needs to train additionally a large teacher model with expensive computational cost and requires complex multi-stages training. This chapter firstly presents a Self-Referenced Deep Learning (SRDL) strategy to accelerate the training process. Unlike both vanilla optimisation and knowledge distillation, SRDL distils the knowledge discovered by the in-training target model back to itself for regularising the subsequent learning procedure therefore eliminating the need for training a large teacher model. Secondly, an On-the-fly Native Ensemble (ONE) learning strategy for one-stage knowledge distillation is proposed to solve the weakness of complex multi-stages training. Specifically, ONE only trains a single multi-branch network while simultaneously establishing a strong teacher on-the-fly to enhance the learning of target network. Chapter 6 Forth, this thesis studies the cross-task KT in coarse-grained object recognition. This chapter focuses on the few-shot classification problem, which aims to train models capable of recognising new, previously unseen categories from the novel task by using only limited training samples. Existing metric learning approaches constitute a highly popular strategy, learning discriminative representations such that images, containing different classes, are well separated in an embedding space. The commonly held assumption that each class is summarised by a sin5 gle, global representation (referred to as a prototype) that is then used as a reference to infer class labels brings significant drawbacks. This formulation fails to capture the complex multi-modal latent distributions that often exist in real-world problems, and yields models that are highly sensitive to the prototype quality. To address these limitations, this chapter proposes a novel Mixture of Prototypes (MP) approach that learns multi-modal class representations, and can be integrated into existing metric based methods. MP models class prototypes as a group of feature representations carefully designed to be highly diverse and maximise ensembling performance. Furthermore, this thesis investigates the benefit of incorporating unlabelled data in cross-task KT, and focuses on the problem of Semi-Supervised Few-shot Learning (SS-FSL). Recent SSFSL work has relied on popular Semi-Supervised Learning (SSL) concepts, involving iterative pseudo-labelling, yet often yields models that are susceptible to error propagation and sensitive to initialisation. To address this limitation, this chapter introduces a novel prototype-based approach (Fewmatch) for SS-FSL that exploits model Consistency Regularization (CR) in a robust manner and promotes cross-task unlabelled data knowledge transfer. Fewmatch exploits unlabelled data via Dynamic Prototype Refinement (DPR) approach, where novel class prototypes are alternatively refined 1) explicitly, using unlabelled data with high confidence class predictions and 2) implicitly, by model fine-tuning using a data selective model CR loss. DPR affords CR convergence, with the explicit refinement providing an increasingly stronger initialisation and alleviates the issue of error propagation, due to the application of CR. Chapter 7 draws conclusions and suggests future works that extend the ideas and methods developed in this thesi
    corecore