46 research outputs found
Scalable Deep Learning Architecture Design.
PhD ThesesThe past decade has witnessed a rapid development in deep learning research which has enabled
remarkable progress on a wide spectrum of computer vision tasks, such as object recognition,
segmentation, and detection. One generic mechanism for deep learning on computer vision is
to design optimal deep neural architectures for given tasks, so as to learn compact, rich and expressive
features for data collected by artificial visual sensors. Nonetheless, deep artificial neural
architecture design for computer vision tasks remains challenging due to the inherent visual task
complexity and uncertainty. One can not guarantee that a specific network designed for one task
assumably works well for new tasks, especially when it comes to considering scalability (the
model size, learning capacity and efficiency, and domain adaptation to new data). Unfortunately,
there are no theoretical principles towards guiding deep neural architecture design, which makes
researchers having to rely on their own expertise and experience ad hoc. This thesis investigates
approaches to designing deep neural architectures for several tasks by considering the underlying
task characteristics for more efficient and powerful deep models. More specifically, this thesis
develops new methods for addressing four different problems as follows:
Chapter 3 The first problem is harmonious attention network design for scalable person reidentification
(re-id). Existing person re-identification (re-id) deep learning methods rely heavily
on the utilisation of large and computationally expensive convolutional neural networks. They
are therefore not scalable to large scale re-id deployment scenarios with the need of processing a
large amount of surveillance video data, due to the lengthy inference process with high computing
costs. in this chapter, we address this limitation via jointly learning re-id attention selection.
Specifically, we formulate a novel Harmonious Attention Network (HAN) framework to jointly
learn soft pixel attention and hard regional attention alongside simultaneous deep feature representation
learning, particularly enabling more discriminative re-id matching by efficient networks
with more scalable inference. Extensive evaluations validate the cost-effectiveness superiority of
the proposed HAN approach for person re-id against a wide variety of state-of-the-art methods
on large benchmark datasets.
Chapter 4 The second problem is hierarchical distillation network design for scalable person
search. Existing person search methods typically focus on improving person detection accuracy.
This ignores the model inference efficiency, which however is fundamentally significant for
real-world applications. in this chapter, we address this limitation by investigating the scalability
problem of person search involving both model accuracy and inference efficiency simultaneously.
Specifically, we formulate a Hierarchical Distillation Learning (HDL) approach. With
HDL, we aim to comprehensively distil the knowledge of a strong teacher model with strong
learning capability to a lightweight student model with weak learning capability. To facilitate
the HDL process, we design a simple and powerful teacher model for joint learning of person
detection and person re-identification matching in unconstrained scene images. Extensive experiments
show the modelling advantages and cost-effectiveness superiority of HDL over the
4
state-of-the-art person search methods on large person search benchmarks.
Chapter 5 The third problem is neural graph embedding for scalable neural architecture
search. Existing neural architecture search (NAS) methods often operate in discrete or continuous
spaces directly, which ignores the graphical topology knowledge of neural networks. This
leads to suboptimal search performance and efficiency, given that neural networks are essentially
directed acyclic graphs (DAG). in this chapter, we address this limitation by introducing a novel
idea of neural graph embedding (NGE). Specifically, we represent the building block (i.e. the
cell) of neural networks with a neural DAG, and learn it by leveraging a Graph Convolutional
Network to propagate and model the intrinsic topology information of network architectures.
This results in a generic neural network representation integrable with different existing NAS
frameworks. Extensive experiments show the superiority of NGE over the state-of-the-art methods
on image classification and semantic segmentation.
Chapter 6 The last problem is scalable neural operator search. Existing neural architecture
search (NAS) methods explore a limited feature-transformation-only search space, ignoring other
advanced feature operations such as feature self-calibration by attention and dynamic convolutions.
This disables the NAS algorithms from discovering more optimal network architectures.
We address this limitation by additionally exploiting feature self-calibration operations, resulting
in a heterogeneous search space. To overcome the challenges of operation heterogeneity and
significantly larger search space, we formulate a neural operator search (NOS) method. NOS
presents a novel heterogeneous residual block for integrating the heterogeneous operations in a
unified structure, and an attention guided search strategy for facilitating the search process over a
vast space. Extensive experiments show that NOS can search novel cell architectures with highly
competitive performance on the CIFAR and ImageNet benchmarks.
Chapter 6 includes concluding remarks and discusses potential areas for future research and
extensions
Person Re-identification and Tracking in Video Surveillance
Video surveillance system is one of the most essential topics in the computer vision field. As the rapid and continuous increasement of using video surveillance cameras to obtain portrait information in scenes, it becomes a very important system for security and criminal investigations. Video surveillance system includes many key technologies, including the object recognition, the object localization, the object re-identification, object tracking, and by which the system can be used to identify or suspect the movements of the objects and persons. In recent years, person re-identification and visual object tracking have become hot research directions in the computer vision field. The re-identification system aims to recognize and identify the target of the required attributes, and the tracking system aims at following and predicting the movement of the target after the identification process. Researchers have used deep learning and computer vision technologies to significantly improve the performance of person re-identification. However, the study of person re-identification is still challenging due to complex application environments such as lightning variations, complex background transformations, low-resolution images, occlusions, and a similar dressing of different pedestrians. The challenge of this task also comes from unavailable bounding boxes for pedestrians, and the need to search for the person over the whole gallery images. To address these critical issues in modern person identification applications, we propose an algorithm that can accurately localize persons by learning to minimize intra-person feature variations. We build our model upon the state-of-the-art object detection framework, i.e., faster R-CNN, so that high-quality region proposals for pedestrians can be produced in an online manner. In addition, to relieve the negative effects caused by varying visual appearances of the same individual, we introduce a novel center loss that can increase the intra-class compactness of feature representations. The engaged center loss encourages persons with the same identity to have similar feature characteristics. Besides the localization of a single person, we explore a more general visual object tracking problem. The main task of the visual object tracking is to predict the location and size of the tracking target accurately and reliably in subsequent image sequences when the target is given at the beginning of the sequence. A visual object tracking algorithm with high accuracy, good stability, and fast inference speed is necessary. In this thesis, we study the updating problem for two kinds of tracking algorithms among the mainstream tracking approaches, and improve the robustness and accuracy. Firstly, we extend the siamese tracker with a model updating mechanism to improve their tracking robustness. A siamese tracker uses a deep convolutional neural network to obtain features and compares the new frame features with the target features in the first frame. The candidate region with the highest similarity score is considered as the tracking result. However, these kinds of trackers are not robust against large target variation due to the no-update matching strategy during the whole tracking process. To combat this defect, we propose an ensemble siamese tracker, where the final similarity score is also affected by the similarity with tracking results in recent frames instead of solely considering the first frame. Tracking results in recent frames are used to adjust the model for a continuous target change. Meanwhile, we combine adaptive candidate sampling strategy and large displacement optical flow method to improve its performance further. Secondly, we investigate the classic correlation filter based tracking algorithm and propose to provide a better model selection strategy by reinforcement learning. Correlation filter has been proven to be a useful tool for a number of approaches in visual tracking, particularly for seeking a good balance between tracking accuracy and speed. However, correlation filter based models are susceptible to wrong updates stemming from inaccurate tracking results. To date, little effort has been devoted to handling the correlation filter update problem. In our approach, we update and maintain multiple correlation filter models in parallel, and we use deep reinforcement learning for the selection of an optimal correlation filter model among them. To facilitate the decision process efficiently, we propose a decision-net to deal with target appearance modeling, which is trained through hundreds of challenging videos using proximal policy optimization and a lightweight learning network. An exhaustive evaluation of the proposed approach on the OTB100 and OTB2013 benchmarks show the effectiveness of our approach
Knowledge Transfer in Object Recognition.
PhD Thesis.Abstract
Object recognition is a fundamental and long-standing problem in computer vision. Since
the latest resurgence of deep learning, thousands of techniques have been proposed and brought
to commercial products to facilitate people’s daily life. Although remarkable achievements in
object recognition have been witnessed, existing machine learning approaches remain far away
from human vision system, especially in learning new concepts and Knowledge Transfer (KT)
across scenarios. One main reason is that current learning approaches address isolated tasks
by independently training predefined models, without considering any knowledge learned from
previous tasks or models. In contrast, humans have an inherent ability to transfer the knowledge
acquired from earlier tasks or people to new scenarios. Therefore, to scaling object recognition
in realistic deployment, effective KT schemes are required.
This thesis studies several aspects of KT for scaling object recognition systems. Specifically,
to facilitate the KT process, several mechanisms on fine-grained and coarse-grained object recognition
tasks are analyzed and studied, including 1) cross-class KT on person re-identification (reid);
2) cross-domain KT on person re-identification; 3) cross-model KT on image classification;
4) cross-task KT on image classification. In summary, four types of knowledge transfer schemes
are discussed as follows:
Chapter 3 Cross-class KT in person re-identification, one of representative fine-grained object
recognition tasks, is firstly investigated. The nature of person identity classes for person
re-id are totally disjoint between training and testing (a zero-shot learning problem), resulting
in the highly demand of cross-class KT. To solve that, existing person re-id approaches aim
to derive a feature representation for pairwise similarity based matching and ranking, which is
able to generalise to test. However, current person re-id methods assume the provision of accurately
cropped person bounding boxes and each of them is in the same resolution, ignoring the
impact of the background noise and variant scale of images to cross-class KT. This is more severed
in practice when person bounding boxes must be detected automatically given a very large
number of images and/or videos (un-constrained scene images) processed. To address these challenges,
this chapter provides two novel approaches, aiming to promote cross-class KT and boost
re-id performance. 1) This chapter alleviates inaccurate person bounding box by developing a
joint learning deep model that optimises person re-id attention selection within any auto-detected
person bounding boxes by reinforcement learning of background clutter minimisation. Specifically,
this chapter formulates a novel unified re-id architecture called Identity DiscriminativE
Attention reinforcement Learning (IDEAL) to accurately select re-id attention in auto-detected
bounding boxes for optimising re-id performance. 2) This chapter addresses multi-scale problem
by proposing a Cross-Level Semantic Alignment (CLSA) deep learning approach capable of
learning more discriminative identity feature representations in a unified end-to-end model. This
4
is realised by exploiting the in-network feature pyramid structure of a deep neural network enhanced
by a novel cross pyramid-level semantic alignment loss function. Extensive experiments
show the modelling advantages and performance superiority of both IDEAL and CLSA over the
state-of-the-art re-id methods on widely used benchmarking datasets.
Chapter 4 In this chapter, we address the problem of cross-domain KT in unsupervised
domain adaptation for person re-id. Specifically, this chapter considers cross-domain KT as
follows: 1) Unsupervised domain adaptation: “train once, run once” pattern, transferring knowledge
from source domain to specific target domain and the model is restricted to be applied
on target domain only; 2) Universal re-id: “train once, run everywhere” pattern, transferring
knowledge from source domain to any target domains, and therefore is capable of deploying any
domains of re-id task. This chapter firstly develops a novel Hierarchical Unsupervised Domain
Adaptation (HUDA) method for unsupervised domain adaptation for re-id. It can automatically
transfer labelled information of an existing dataset (a source domain) to an unlabelled target
domain for unsupervised person re-id. Specifically, HUDA is designed to model jointly global
distribution alignment and local instance alignment in a two-level hierarchy for discovering transferable
source knowledge in unsupervised domain adaptation. Crucially, this approach aims to
overcome the under-constrained learning problem of existing unsupervised domain adaptation
methods, lacking of the local instance alignment constraint. The consequence is more effective
and cross-domain KT from the labelled source domain to the unlabelled target domain. This
chapter further addresses the limitation of “train once, run once ” for existing domain adaptation
person re-id approaches by presenting a novel “train once, run everywhere” pattern. This
conventional “train once, run once” pattern is unscalable to a large number of target domains
typically encountered in real-world deployments, due to the requirement of training a separate
model for each target domain as supervised learning methods. To mitigate this weakness, a novel
“Universal Model Learning” (UML) approach is formulated to enable domain-generic person
re-id using only limited training data of a “single” seed domain. Specifically, UML trains a universal
re-id model to discriminate between a set of transformed person identity classes. Each of
such classes is formed by applying a variety of random appearance transformations to the images
of that class, where the transformations simulate camera viewing conditions of any domains for
making the model domain generic.
Chapter 5 The third problem considered in this thesis is cross-model KT in coarse-grained
object recognition. This chapter discusses knowledge distillation in image classification. Knowledge
distillation is an effective approach to transfer knowledge from a large teacher neural network
to a small student (target) network for satisfying the low-memory and fast running requirements.
Whilst being able to create stronger target networks compared to the vanilla non-teacher
based learning strategy, this scheme needs to train additionally a large teacher model with expensive
computational cost and requires complex multi-stages training. This chapter firstly presents
a Self-Referenced Deep Learning (SRDL) strategy to accelerate the training process. Unlike
both vanilla optimisation and knowledge distillation, SRDL distils the knowledge discovered
by the in-training target model back to itself for regularising the subsequent learning procedure
therefore eliminating the need for training a large teacher model. Secondly, an On-the-fly Native
Ensemble (ONE) learning strategy for one-stage knowledge distillation is proposed to solve the
weakness of complex multi-stages training. Specifically, ONE only trains a single multi-branch
network while simultaneously establishing a strong teacher on-the-fly to enhance the learning of
target network.
Chapter 6 Forth, this thesis studies the cross-task KT in coarse-grained object recognition.
This chapter focuses on the few-shot classification problem, which aims to train models capable
of recognising new, previously unseen categories from the novel task by using only limited training
samples. Existing metric learning approaches constitute a highly popular strategy, learning
discriminative representations such that images, containing different classes, are well separated
in an embedding space. The commonly held assumption that each class is summarised by a sin5
gle, global representation (referred to as a prototype) that is then used as a reference to infer class
labels brings significant drawbacks. This formulation fails to capture the complex multi-modal
latent distributions that often exist in real-world problems, and yields models that are highly
sensitive to the prototype quality. To address these limitations, this chapter proposes a novel
Mixture of Prototypes (MP) approach that learns multi-modal class representations, and can be
integrated into existing metric based methods. MP models class prototypes as a group of feature
representations carefully designed to be highly diverse and maximise ensembling performance.
Furthermore, this thesis investigates the benefit of incorporating unlabelled data in cross-task
KT, and focuses on the problem of Semi-Supervised Few-shot Learning (SS-FSL). Recent SSFSL
work has relied on popular Semi-Supervised Learning (SSL) concepts, involving iterative
pseudo-labelling, yet often yields models that are susceptible to error propagation and sensitive
to initialisation. To address this limitation, this chapter introduces a novel prototype-based approach
(Fewmatch) for SS-FSL that exploits model Consistency Regularization (CR) in a robust
manner and promotes cross-task unlabelled data knowledge transfer. Fewmatch exploits unlabelled
data via Dynamic Prototype Refinement (DPR) approach, where novel class prototypes
are alternatively refined 1) explicitly, using unlabelled data with high confidence class predictions
and 2) implicitly, by model fine-tuning using a data selective model CR loss. DPR affords
CR convergence, with the explicit refinement providing an increasingly stronger initialisation
and alleviates the issue of error propagation, due to the application of CR.
Chapter 7 draws conclusions and suggests future works that extend the ideas and methods
developed in this thesi