158,291 research outputs found

    Semantics-Aligned Representation Learning for Person Re-identification

    Full text link
    Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID. Code for our proposed method is available at: https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.Comment: Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), code has been release

    An investigation into automatic people counting and person re-identification

    Get PDF
    We study two video surveillance problems in this thesis including people counting and person re-identification. To address the problem of people counting, we first propose a method called Random Projection Forest to utilise rich hand-crafted features. To achieve computational efficiency and scalability, we use random forest as the regression model whose tree structure is intrinsically fast and scalable. Unlike traditional approaches to random forest construction, we embed random projection in the tree nodes to simultaneously combat the curse of dimensionality and to introduce randomness in the tree construction thus making our new method very efficient and effective. We have also developed a deep learning model for people counting. We propose a multi-task deep learning model to simultaneously predict people number and the level of crowd density, which makes our method invariant to the image scale. To deal with problem of insufficient size of training dataset, we propose an "ambiguous labelling" strategy to create various labels for the training images. In a series of experiment, we show that creating ``ambiguous label" is a simple but effective method to improve not only the deep learning model but also the Random Projection Forest model based on hand-crafted features. For the problem of person re-identification, we have developed a novel deep learning framework called Deep Augmented Attribute Network (DAAN) to learn augmented attribute features for person re-identification. We first manually label two large datasets with pre-defined mid-level semantic attributes. We then construct a deep neural network with two output branches. The first branch predicts the attributes of the input image, while the second branch generates complement features that are fused with the output of the first branch to form the augmented attributes of the input image. We optimize the attribute branch with multiple-label classification loss and apply a ’Siamese’ network structure to ensure that the augmented attributes of images from the same person are close to each other whilst those from different persons are far apart. The final learned augmented attribute features are then used for person re-identification based on Euclidean distance. As manually labelling images is a time-consuming process, we have also extended our method to datasets with only person ID information but without attribute labels. We have conducted comprehensive experiments and results show that our method outperforms state-of-the-art methods. As labelling identity and attribute for person image is time consuming, we thus propose an unsupervised method to solve person re-identification and apply it to a more challenging problem called partial person re-identification. We first use an established image segmentation method to generate superpixels to construct an Attributed Region Adjacency Graph (ARAG) in which nodes corresponding with superpixels and edges representing correlations between superpixels. We then apply region-based Normalized Cut to the graph to merge similar neighbouring superpixels in order to form natural image regions corresponding to various body parts and backgrounds. To extract feature from segmented patches, we apply a Denoising Autoencoder to learn discriminative representation of image patches in each node of the graph. Finally, the similarity of an image pair is measured by the Earth Mover's Distance (EMD) between the robust image signatures of the nodes in the corresponding ARAGs

    μ˜μƒ 기반 동일인 νŒλ³„μ„ μœ„ν•œ λΆ€λΆ„ μ •ν•© ν•™μŠ΅

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2019. 2. 이경무.Person re-identification is a problem of identifying the same individuals among the persons captured from different cameras. It is a challenging problem because the same person captured from non-overlapping cameras usually shows dramatic appearance change due to the viewpoint, pose, and illumination changes. Since it is an essential tool for many surveillance applications, various research directions have been exploredhowever, it is far from being solved. The goal of this thesis is to solve person re-identification problem under the surveillance system. In particular, we focus on two critical components: designing 1) a better image representation model using human poses and 2) a better training method using hard sample mining. First, we propose a part-aligned representation model which represents an image as the bilinear pooling between appearance and part maps. Since the image similarity is independently calculated from the locations of body parts, it addresses the body part misalignment issue and effectively distinguishes different people by discriminating fine-grained local differences. Second, we propose a stochastic hard sample mining method that exploits class information to generate diverse and hard examples to use for training. It efficiently explores the training samples while avoiding stuck in a small subset of hard samples, thereby effectively training the model. Finally, we propose an integrated system that combines the two approaches, which is benefited from both components. Experimental results show that the proposed method works robustly on five datasets with diverse conditions and its potential extension to the more general conditions.동일인 νŒλ³„λ¬Έμ œλŠ” λ‹€λ₯Έ μΉ΄λ©”λΌλ‘œ 촬영된 각각의 μ˜μƒμ— 찍힌 두 μ‚¬λžŒμ΄ 같은 μ‚¬λžŒμΈμ§€ μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜λŠ” λ¬Έμ œμ΄λ‹€. μ΄λŠ” κ°μ‹œμΉ΄λ©”λΌμ™€ λ³΄μ•ˆμ— κ΄€λ ¨λœ λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ μ€‘μš”ν•œ λ„κ΅¬λ‘œ ν™œμš©λ˜κΈ° λ•Œλ¬Έμ— μ΅œκ·ΌκΉŒμ§€ λ§Žμ€ 연ꡬ가 이루어지고 μžˆλ‹€. κ·ΈλŸ¬λ‚˜ 같은 μ‚¬λžŒμ΄λ”λΌλ„ μ‹œκ°„, μž₯μ†Œ, 촬영 각도, μ‘°λͺ… μƒνƒœκ°€ λ‹€λ₯Έ ν™˜κ²½μ—μ„œ 찍히면 μ˜μƒλ§ˆλ‹€ λ³΄μ΄λŠ” λͺ¨μŠ΅μ΄ λ‹¬λΌμ§€λ―€λ‘œ νŒλ³„μ„ μžλ™ν™”ν•˜κΈ° μ–΄λ ΅λ‹€λŠ” λ¬Έμ œκ°€ μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 주둜 κ°μ‹œμΉ΄λ©”λΌ μ˜μƒμ— λŒ€ν•΄μ„œ, 각 μ˜μƒμ—μ„œ μžλ™μœΌλ‘œ μ‚¬λžŒμ„ κ²€μΆœν•œ 후에 κ²€μΆœν•œ 결과듀이 μ„œλ‘œ 같은 μ‚¬λžŒμΈμ§€ μ—¬λΆ€λ₯Ό νŒλ‹¨ν•˜λŠ” 문제λ₯Ό ν’€κ³ μž ν•œλ‹€. 이λ₯Ό μœ„ν•΄ 1) μ–΄λ–€ λͺ¨λΈμ΄ μ˜μƒμ„ 잘 ν‘œν˜„ν• κ²ƒμΈμ§€ 2) 주어진 λͺ¨λΈμ„ μ–΄λ–»κ²Œ 잘 ν•™μŠ΅μ‹œν‚¬μˆ˜ μžˆμ„μ§€ 두 가지 μ§ˆλ¬Έμ— λŒ€ν•΄μ„œ μ—°κ΅¬ν•œλ‹€. λ¨Όμ € 벑터 곡간 μƒμ—μ„œμ˜ 거리가 이미지 μƒμ—μ„œ λŒ€μ‘λ˜λŠ” νŒŒνŠΈλ“€ μ‚¬μ΄μ˜ μƒκΉ€μƒˆ 차이의 ν•©κ³Ό 같아지도둝 ν•˜λŠ” 맀핑 ν•¨μˆ˜λ₯Ό μ„€κ³„ν•¨μœΌλ‘œμ¨ κ²€μΆœλœ μ‚¬λžŒλ“€ 사이에 신체 λΆ€λΆ„λ³„λ‘œ μƒκΉ€μƒˆλ₯Ό 비ꡐλ₯Ό 톡해 효과적인 νŒλ³„μ„ κ°€λŠ₯ν•˜κ²Œ ν•˜λŠ” λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. λ‘λ²ˆμ§Έλ‘œ ν•™μŠ΅ κ³Όμ •μ—μ„œ 클래슀 정보λ₯Ό ν™œμš©ν•΄μ„œ 적은 κ³„μ‚°λŸ‰μœΌλ‘œ μ–΄λ €μš΄ μ˜ˆμ‹œλ₯Ό 많이 보도둝 ν•¨μœΌλ‘œμ¨ 효과적으둜 ν•¨μˆ˜μ˜ νŒŒλΌλ―Έν„°λ₯Ό ν•™μŠ΅ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. μ΅œμ’…μ μœΌλ‘œλŠ” 두 μš”μ†Œλ₯Ό κ²°ν•©ν•΄μ„œ μƒˆλ‘œμš΄ 동일인 νŒλ³„ μ‹œμŠ€ν…œμ„ μ œμ•ˆν•˜κ³ μž ν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ‹€ν—˜κ²°κ³Όλ₯Ό 톡해 μ œμ•ˆν•˜λŠ” 방법이 λ‹€μ–‘ν•œ ν™˜κ²½μ—μ„œ κ°•μΈν•˜κ³  효과적으둜 λ™μž‘ν•¨μ„ 증λͺ…ν•˜μ˜€κ³  보닀 일반적인 ν™˜κ²½μœΌλ‘œμ˜ ν™•μž₯ κ°€λŠ₯성도 확인 ν•  수 μžˆμ„ 것이닀.Abstract i Contents ii List of Tables v List of Figures vii 1. Introduction 1 1.1 Part-Aligned Bilinear Representations . . . . . . . . . . . . . . . . . 3 1.2 Stochastic Class-Based Hard Sample Mining . . . . . . . . . . . . . 4 1.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 5 2. Part-Aligned Bilinear Represenatations 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Two-Stream Network . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Part-Aware Image Similarity . . . . . . . . . . . . . . . . . . 13 2.4.2 Relationship to the Baseline Models . . . . . . . . . . . . . . 15 2.4.3 Decomposition of Appearance and Part Maps . . . . . . . . . 15 2.4.4 Part-Alignment Effects on Reducing Misalignment Issue . . . 19 2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.3 Comparison with the Baselines . . . . . . . . . . . . . . . . . 24 2.6.4 Comparison with State-of-the-Art Methods . . . . . . . . . . 25 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3. Stochastic Class-Based Hard Sample Mining 35 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Deep Metric Learning with Triplet Loss . . . . . . . . . . . . . . . . 40 3.3.1 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Efficient Learning with Triplet Loss . . . . . . . . . . . . . . 41 3.4 Batch Construction for Metric Learning . . . . . . . . . . . . . . . . 42 3.4.1 Neighbor Class Mining by Class Signatures . . . . . . . . . . 42 3.4.2 Batch Construction . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.3 Scalable Extension to the Number of Classes . . . . . . . . . 50 3.5 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . 55 3.7.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.4 Effect of the Stochastic Hard Example Mining . . . . . . . . 59 3.7.5 Comparison with the Existing Methods on Image Retrieval Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70 4. Integrated System for Person Re-identification 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Hard Positive Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 75 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Comparison with the baselines . . . . . . . . . . . . . . . . . 75 4.4.2 Comparison with the existing works . . . . . . . . . . . . . . 80 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.Conclusion 83 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Abstract (In Korean) 94Docto

    Deep learning with very few and no labels

    Get PDF
    Deep neural networks have achieved remarkable performance in many computer vision applications such as image classification, object detection, instance segmentation, image retrieval, and person re-identification. However, to achieve the desired performance, deep neural networks often need a tremendously large set of labeled training samples to learn its huge network model. Labeling a large dataset is labor-intensive, time-consuming, and sometimes requiring expert knowledge. In this research, we study the following important question: how to train deep neural networks with very few or even no labeled samples? This leads to our research tasks in the following two major areas: semi-supervised and unsupervised learning. Specifically, for semi-supervised learning, we developed two major approaches. The first one is the Snowball approach which learns a deep neural network from very few samples based on iterative model evolution and confident sample discovery. The second one is the learned model composition approach which composes more efficient master networks from student models of past iterations through a network learning process. Critical sample discovery is developed to discover new critical unlabeled samples near the model decision boundary and provide the master model with lookahead access to these samples to enhance its guidance capability. For unsupervised learning, we have explored two major ideas. The first idea is transformed attention consistency where the network is learned based on selfsupervision information across images instead of within one single image. The second one is spatial assembly networks for image representation learning. We introduce a new learnable module, called spatial assembly network (SAN), which performs a learned re-organization and assembly of feature points and improves the network capabilities in handling spatial variations and structural changes of the image scene. Our experimental results on benchmark datasets demonstrate that our proposed methods have significantly improved the state-of-the-art in semi-supervised and unsupervised learning, outperforming existing methods by large margins.Includes bibliographical references

    Person Re-identification by Local Maximal Occurrence Representation and Metric Learning

    Full text link
    Person re-identification is an important technique towards automatic search of a person's presence in a surveillance video. Two fundamental problems are critical for person re-identification, feature representation and metric learning. An effective feature representation should be robust to illumination and viewpoint changes, and a discriminant metric should be learned to match various person images. In this paper, we propose an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA). The LOMO feature analyzes the horizontal occurrence of local features, and maximizes the occurrence to make a stable representation against viewpoint changes. Besides, to handle illumination variations, we apply the Retinex transform and a scale invariant texture operator. To learn a discriminant metric, we propose to learn a discriminant low dimensional subspace by cross-view quadratic discriminant analysis, and simultaneously, a QDA metric is learned on the derived subspace. We also present a practical computation method for XQDA, as well as its regularization. Experiments on four challenging person re-identification databases, VIPeR, QMUL GRID, CUHK Campus, and CUHK03, show that the proposed method improves the state-of-the-art rank-1 identification rates by 2.2%, 4.88%, 28.91%, and 31.55% on the four databases, respectively.Comment: This paper has been accepted by CVPR 2015. For source codes and extracted features please visit http://www.cbsr.ia.ac.cn/users/scliao/projects/lomo_xqda

    Person re-identification via efficient inference in fully connected CRF

    Full text link
    In this paper, we address the problem of person re-identification problem, i.e., retrieving instances from gallery which are generated by the same person as the given probe image. This is very challenging because the person's appearance usually undergoes significant variations due to changes in illumination, camera angle and view, background clutter, and occlusion over the camera network. In this paper, we assume that the matched gallery images should not only be similar to the probe, but also be similar to each other, under suitable metric. We express this assumption with a fully connected CRF model in which each node corresponds to a gallery and every pair of nodes are connected by an edge. A label variable is associated with each node to indicate whether the corresponding image is from target person. We define unary potential for each node using existing feature calculation and matching techniques, which reflect the similarity between probe and gallery image, and define pairwise potential for each edge in terms of a weighed combination of Gaussian kernels, which encode appearance similarity between pair of gallery images. The specific form of pairwise potential allows us to exploit an efficient inference algorithm to calculate the marginal distribution of each label variable for this dense connected CRF. We show the superiority of our method by applying it to public datasets and comparing with the state of the art.Comment: 7 pages, 4 figure
    • …
    corecore