219,861 research outputs found

    Scalable deep feature learning for person re-identification

    Get PDF
    Person Re-identification (Person Re-ID) is one of the fundamental and critical tasks of the video surveillance systems. Given a probe image of a person obtained from one Closed Circuit Television (CCTV) camera, the objective of Person Re-ID is to identify the same person from a large gallery set of images captured by other cameras within the surveillance system. By successfully associating all the pedestrians, we can quickly search, track and even plot a movement trajectory of any person of interest within a CCTV system. Currently, most search and re-identification jobs are still processed manually by police or security officers. It is desirable to automate this process in order to reduce an enormous amount of human labour and increase the pedestrian tracking and retrieval speed. However, Person Re-ID is a challenging problem because of so many uncontrolled properties of a multi-camera surveillance system: cluttered backgrounds, large illumination variations, different human poses and different camera viewing angles. The main goal of this thesis is to develop deep learning based person reidentification models for real-world deployment in surveillance system. This thesis focuses on learning and extracting robust feature representations of pedestrians. In this thesis, we first proposed two supervised deep neural network architectures. One end-to-end Siamese network is developed for real-time person matching tasks. It focuses on extracting the correspondence feature between two images. For an offline person retrieval application, we follow the commonly used feature extraction with distance metric two-stage pipline and propose a strong feature embedding extraction network. In addition, we surveyed many valuable training techniques proposed recently in the literature to integrate them with our newly proposed NP-Triplet xiii loss to construct a strong Person Re-ID feature extraction model. However, during the deployment of the online matching and offline retrieval system, we realise the poor scalability issue in most supervised models. A model trained from labelled images obtained from one system cannot perform well on other unseen systems. Aiming to make the Person Re-ID models more scalable for different surveillance systems, the third work of this thesis presents cross-Dataset feature transfer method (MMFA). MMFA can train and transfer the model learned from one system to another simultaneously. Our goal to create a more scalable and robust person reidentification system did not stop here. The last work of this thesis, we address the limitation of MMFA structure and proposed a multi-dataset feature generalisation approach (MMFA-AAE), which aims to learn a universal feature representation from multiple labelled datasets. Aiming to facilitate the research towards Person Re-ID applications in more realistic scenarios, a new datasets ROSE-IDENTITY-Outdoor (RE-ID-Outdoor) has been collected and annotated with the largest number of cameras and 40 mid-level attributes

    Multi-Resolution Overlapping Stripes Network for Person Re-Identification

    Full text link
    This paper addresses the person re-identification (PReID) problem by combining global and local information at multiple feature resolutions with different loss functions. Many previous studies address this problem using either part-based features or global features. In case of part-based representation, the spatial correlation between these parts is not considered, while global-based representation are not sensitive to spatial variations. This paper presents a part-based model with a multi-resolution network that uses different level of features. The output of the last two conv blocks is then partitioned horizontally and processed in pairs with overlapping stripes to cover the important information that might lie between parts. We use different loss functions to combine local and global information for classification. Experimental results on a benchmark dataset demonstrate that the presented method outperforms the state-of-the-art methods.Comment: 5 page

    Deep Multi-View Learning for Visual Understanding

    Get PDF
    PhD ThesisMulti-view data is the result of an entity being perceived or represented from multiple perspectives. Plenty of applications in visual understanding contain multi-view data. For example, the face images for training a recognition system are usually captured by different devices from multiple angles. This thesis focuses on the cross-view visual recognition problems, e.g., identifying the face images of the same person across different cameras. Several representative multi-view settings, from the supervised multi-view learning to the more challenging unsupervised domain adaptive (UDA) multi-view learning, are investigated. Novel multi-view learning algorithms are proposed correspondingly. To be more specific, the proposed methods are based on the advanced deep neural network (DNN) architectures for better handling visual data. However, directly combining the multi-view learning objectives with DNN can result in different issues, e.g., on scalability, and limit the application scenarios and model performance. Corresponding novelties in DNN methods are thus required to solve them. This thesis is organised into three parts. Each chapter focuses on a multi-view learning setting with novel solutions and is detailed as follows: Chapter 3 A supervised multi-view learning setting with two different views are studied. To recognise the data samples across views, one strategy is aligning them in a common feature space via correlation maximisation. It is also known as canonical correlation analysis (CCA). Deep CCA has been proposed for better performance with the non-linear projection via deep neural networks. Existing deep CCA models typically decorrelate the deep feature dimensions of each view before their Euclidean distances are minimised in the common space. This feature decorrelation is achieved by enforcing an exact decorrelation constraint which is computationally expensive due to the matrix inversion or SVD operations. Therefore, existing deep CCA models are inefficient and have scalability issues. Furthermore, the exact decorrelation is incompatible with the gradient based deep model training and results in sub-optimal solution. To overcome these aforementioned issues, a novel deep CCA model Soft CCA is introduced in this thesis. Specifically, the exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL). It can be jointly optimised with the other training objectives. In addition, our SDL loss can be applied to other deep models beyond multi-view learning. Chapter 4 The supervised multi-view learning setting, whereby more than two views exist, are studied in this chapter. Recently developed deep multi-view learning algorithms either learn a latent visual representation based on a single semantic level and/or require laborious human annotation of these factors as attributes. A novel deep neural network architecture, called Multi- Level Factorisation Net (MLFN), is proposed to automatically factorise the visual appearance into latent discriminative factors at multiple semantic levels without manual annotation. The main purpose is forcing different views share the same latent factors so that they are can be aligned at all layers. Specifically, MLFN is composed of multiple stacked blocks. Each block contains multiple factor modules to model latent factors at a specific level, and factor selection modules that dynamically select the factor modules to interpret the content of each input image. The outputs of the factor selection modules also provide a compact latent factor descriptor that is complementary to the conventional deeply learned feature, and they can be fused efficiently. The effectiveness of the proposed MLFN is demonstrated by not only the large-scale cross-view recognition problems but also the general object categorisation tasks. Chapter 5 The last problem is a special unsupervised domain adaptation setting called unsupervised domain adaptive (UDA) multi-view learning. It contains a fully annotated dataset as the source domain and another unsupervised dataset with relevant tasks as the target domain. The main purpose is to improve the performance of the unlabelled dataset with the annotated data from the other dataset. More importantly, this setting further requires both the source and target domains are multi-view datasets with relevant tasks. Therefore, the assumption of the aligned label space across domains is inappropriate in the UDA multi-view learning. For example, the person re-identification (Re-ID) datasets built on different surveillance scenarios are with images of different people captured and should be given disjoint person identity labels. Existing methods for UDA multi-view learning problems are aligning different domains either in the raw image space or a feature embedding space for domain alignment. In this thesis, a different framework, multi-task learning, is adopted with the domain specific objectives for a common space learning. Specifically, such common space is proposed to enable the knowledge transfer. The conventional supervised losses can be used for the labelled source data while the unsupervised objectives for the target domain play the key roles in domain adaptation. Two novel unsupervised objectives are introduced for UDA multi-view learning and result in two models as below. The first model, termed common factorised space model (CFSM), is built on the assumptions that the semantic latent attributes are shared between the source and target domains since they are relevant multi-view learning tasks. Different from the existing methods that based on domain alignment, CFSM emphasizes on transferring the information across domains via discovering discriminative latent factors in the proposed common space. However, the multi-view data from target domain is without labels. Therefore, an unsupervised factorisation loss is derived and applied on the common space for latent factors discovery across domains. The second model still learns a shared embedding space with multi-view data from both domains but with a different assumption. It attempts to discover the latent correspondence of multi-view data in the unsupervised target data. The target data’s contribution comes from a clustering process. Each cluster thus reveals the underlying cross-view correspondences across multiple views in target domain. To this end, a novel Stochastic Inference for Deep Clustering (SIDC) method is proposed. It reduces self-reinforcing errors that lead to premature convergence to a sub-optimal solution by changing the conventional deterministic cluster assignment to a stochastic one
    • …
    corecore