158,291 research outputs found
Semantics-Aligned Representation Learning for Person Re-identification
Person re-identification (reID) aims to match person images to retrieve the
ones with the same identity. This is a challenging task, as the images to be
matched are generally semantically misaligned due to the diversity of human
poses and capture viewpoints, incompleteness of the visible bodies (due to
occlusion), etc. In this paper, we propose a framework that drives the reID
network to learn semantics-aligned feature representation through delicate
supervision designs. Specifically, we build a Semantics Aligning Network (SAN)
which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder
(SA-Dec) for reconstructing/regressing the densely semantics aligned full
texture image. We jointly train the SAN under the supervisions of person
re-identification and aligned texture generation. Moreover, at the decoder,
besides the reconstruction loss, we add Triplet ReID constraints over the
feature maps as the perceptual losses. The decoder is discarded in the
inference and thus our scheme is computationally efficient. Ablation studies
demonstrate the effectiveness of our design. We achieve the state-of-the-art
performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the
partial person reID dataset Partial REID. Code for our proposed method is
available at:
https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.Comment: Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20),
code has been release
An investigation into automatic people counting and person re-identification
We study two video surveillance problems in this thesis including people counting and person re-identification.
To address the problem of people counting, we first propose a method called Random Projection Forest to utilise rich hand-crafted features.
To achieve computational efficiency and scalability, we use random forest as the regression model whose tree structure is intrinsically fast and scalable. Unlike traditional approaches to random forest construction, we embed random projection in the tree nodes to simultaneously combat the curse of dimensionality and to introduce randomness in the tree construction thus making our new method very efficient and effective.
We have also developed a deep learning model for people counting. We propose a multi-task deep learning model to simultaneously predict people number and the level of crowd density, which makes our method invariant to the image scale. To deal with problem of insufficient size of training dataset, we propose an "ambiguous labelling" strategy to create various labels for the training images. In a series of experiment, we show that creating ``ambiguous label" is a simple but effective method to improve not only the deep learning model but also the Random Projection Forest model based on hand-crafted features.
For the problem of person re-identification, we have developed a novel deep learning framework called Deep Augmented Attribute Network (DAAN) to learn augmented attribute features for person re-identification. We first manually label two large datasets with pre-defined mid-level semantic attributes. We then construct a deep neural network with two output branches. The first branch predicts the attributes of the input image, while the second branch generates complement features that are fused with the output of the first branch to form the augmented attributes of the input image. We optimize the attribute branch with multiple-label classification loss and apply a βSiameseβ network structure to ensure that the augmented attributes of images from the same person are close to each other whilst those from different persons are far apart. The final learned augmented attribute features are then used for person re-identification based on Euclidean distance. As manually labelling images is a time-consuming process, we have also extended our method to datasets with only person ID information but without attribute labels. We have conducted comprehensive experiments and results show that our method outperforms state-of-the-art methods.
As labelling identity and attribute for person image is time consuming, we thus propose an unsupervised method to solve person re-identification and apply it to a more challenging problem called partial person re-identification. We first use an established image segmentation method to generate superpixels to construct an Attributed Region Adjacency Graph (ARAG) in which nodes corresponding with superpixels and edges representing correlations between superpixels. We then apply region-based Normalized Cut to the graph to merge similar neighbouring superpixels in order to form natural image regions corresponding to various body parts and backgrounds. To extract feature from segmented patches, we apply a Denoising Autoencoder to learn discriminative representation of image patches in each node of the graph. Finally, the similarity of an image pair is measured by the Earth Mover's Distance (EMD) between the robust image signatures of the nodes in the corresponding ARAGs
μμ κΈ°λ° λμΌμΈ νλ³μ μν λΆλΆ μ ν© νμ΅
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2019. 2. μ΄κ²½λ¬΄.Person re-identification is a problem of identifying the same individuals among the persons captured from different cameras. It is a challenging problem because the same person captured from non-overlapping cameras usually shows dramatic appearance change due to the viewpoint, pose, and illumination changes. Since it is an essential tool for many surveillance applications, various research directions have been exploredhowever, it is far from being solved.
The goal of this thesis is to solve person re-identification problem under the surveillance system. In particular, we focus on two critical components: designing 1) a better image representation model using human poses and 2) a better training method using hard sample mining. First, we propose a part-aligned representation model which represents an image as the bilinear pooling between appearance and part maps. Since the image similarity is independently calculated from the locations of body parts, it addresses the body part misalignment issue and effectively distinguishes different people by discriminating fine-grained local differences. Second, we propose a stochastic hard sample mining method that exploits class information to generate diverse and hard examples to use for training. It efficiently explores the training samples while avoiding stuck in a small subset of hard samples, thereby effectively training the model. Finally, we propose an integrated system that combines the two approaches, which is benefited from both components. Experimental results show that the proposed method works robustly on five datasets with diverse conditions and its potential extension to the more general conditions.λμΌμΈ νλ³λ¬Έμ λ λ€λ₯Έ μΉ΄λ©λΌλ‘ 촬μλ κ°κ°μ μμμ μ°ν λ μ¬λμ΄ κ°μ μ¬λμΈμ§ μ¬λΆλ₯Ό νλ¨νλ λ¬Έμ μ΄λ€. μ΄λ κ°μμΉ΄λ©λΌμ 보μμ κ΄λ ¨λ λ€μν μμ© λΆμΌμμ μ€μν λκ΅¬λ‘ νμ©λκΈ° λλ¬Έμ μ΅κ·ΌκΉμ§ λ§μ μ°κ΅¬κ° μ΄λ£¨μ΄μ§κ³ μλ€. κ·Έλ¬λ κ°μ μ¬λμ΄λλΌλ μκ°, μ₯μ, 촬μ κ°λ, μ‘°λͺ
μνκ° λ€λ₯Έ νκ²½μμ μ°νλ©΄ μμλ§λ€ 보μ΄λ λͺ¨μ΅μ΄ λ¬λΌμ§λ―λ‘ νλ³μ μλννκΈ° μ΄λ ΅λ€λ λ¬Έμ κ° μλ€.
λ³Έ λ
Όλ¬Έμμλ μ£Όλ‘ κ°μμΉ΄λ©λΌ μμμ λν΄μ, κ° μμμμ μλμΌλ‘ μ¬λμ κ²μΆν νμ κ²μΆν κ²°κ³Όλ€μ΄ μλ‘ κ°μ μ¬λμΈμ§ μ¬λΆλ₯Ό νλ¨νλ λ¬Έμ λ₯Ό νκ³ μ νλ€. μ΄λ₯Ό μν΄ 1) μ΄λ€ λͺ¨λΈμ΄ μμμ μ ννν κ²μΈμ§ 2) μ£Όμ΄μ§ λͺ¨λΈμ μ΄λ»κ² μ νμ΅μν¬μ μμμ§ λ κ°μ§ μ§λ¬Έμ λν΄μ μ°κ΅¬νλ€. λ¨Όμ λ²‘ν° κ³΅κ° μμμμ κ±°λ¦¬κ° μ΄λ―Έμ§ μμμ λμλλ ννΈλ€ μ¬μ΄μ μκΉμ μ°¨μ΄μ ν©κ³Ό κ°μμ§λλ‘ νλ 맀ν ν¨μλ₯Ό μ€κ³ν¨μΌλ‘μ¨ κ²μΆλ μ¬λλ€ μ¬μ΄μ μ 체 λΆλΆλ³λ‘ μκΉμλ₯Ό λΉκ΅λ₯Ό ν΅ν΄ ν¨κ³Όμ μΈ νλ³μ κ°λ₯νκ² νλ λͺ¨λΈμ μ μνλ€. λλ²μ§Έλ‘ νμ΅ κ³Όμ μμ ν΄λμ€ μ 보λ₯Ό νμ©ν΄μ μ μ κ³μ°λμΌλ‘ μ΄λ €μ΄ μμλ₯Ό λ§μ΄ 보λλ‘ ν¨μΌλ‘μ¨ ν¨κ³Όμ μΌλ‘ ν¨μμ νλΌλ―Έν°λ₯Ό νμ΅νλ λ°©λ²μ μ μνλ€. μ΅μ’
μ μΌλ‘λ λ μμλ₯Ό κ²°ν©ν΄μ μλ‘μ΄ λμΌμΈ νλ³ μμ€ν
μ μ μνκ³ μ νλ€. λ³Έ λ
Όλ¬Έμμλ μ€νκ²°κ³Όλ₯Ό ν΅ν΄ μ μνλ λ°©λ²μ΄ λ€μν νκ²½μμ κ°μΈνκ³ ν¨κ³Όμ μΌλ‘ λμν¨μ μ¦λͺ
νμκ³ λ³΄λ€ μΌλ°μ μΈ νκ²½μΌλ‘μ νμ₯ κ°λ₯μ±λ νμΈ ν μ μμ κ²μ΄λ€.Abstract i
Contents ii
List of Tables v
List of Figures vii
1. Introduction 1
1.1 Part-Aligned Bilinear Representations . . . . . . . . . . . . . . . . . 3
1.2 Stochastic Class-Based Hard Sample Mining . . . . . . . . . . . . . 4
1.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 5
2. Part-Aligned Bilinear Represenatations 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Two-Stream Network . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Part-Aware Image Similarity . . . . . . . . . . . . . . . . . . 13
2.4.2 Relationship to the Baseline Models . . . . . . . . . . . . . . 15
2.4.3 Decomposition of Appearance and Part Maps . . . . . . . . . 15
2.4.4 Part-Alignment Effects on Reducing Misalignment Issue . . . 19
2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.3 Comparison with the Baselines . . . . . . . . . . . . . . . . . 24
2.6.4 Comparison with State-of-the-Art Methods . . . . . . . . . . 25
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3. Stochastic Class-Based Hard Sample Mining 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Deep Metric Learning with Triplet Loss . . . . . . . . . . . . . . . . 40
3.3.1 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Efficient Learning with Triplet Loss . . . . . . . . . . . . . . 41
3.4 Batch Construction for Metric Learning . . . . . . . . . . . . . . . . 42
3.4.1 Neighbor Class Mining by Class Signatures . . . . . . . . . . 42
3.4.2 Batch Construction . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.3 Scalable Extension to the Number of Classes . . . . . . . . . 50
3.5 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . 55
3.7.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.4 Effect of the Stochastic Hard Example Mining . . . . . . . . 59
3.7.5 Comparison with the Existing Methods on Image Retrieval
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
4. Integrated System for Person Re-identification 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Hard Positive Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Integrated System for Person Re-identification . . . . . . . . . . . . . 75
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Comparison with the baselines . . . . . . . . . . . . . . . . . 75
4.4.2 Comparison with the existing works . . . . . . . . . . . . . . 80
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.Conclusion 83
5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Abstract (In Korean) 94Docto
Deep learning with very few and no labels
Deep neural networks have achieved remarkable performance in many computer vision applications such as image classification, object detection, instance segmentation, image retrieval, and person re-identification. However, to achieve the desired performance, deep neural networks often need a tremendously large set of labeled training samples to learn its huge network model. Labeling a large dataset is labor-intensive, time-consuming, and sometimes requiring expert knowledge. In this research, we study the following important question: how to train deep neural networks with very few or even no labeled samples? This leads to our research tasks in the following two major areas: semi-supervised and unsupervised learning. Specifically, for semi-supervised learning, we developed two major approaches. The first one is the Snowball approach which learns a deep neural network from very few samples based on iterative model evolution and confident sample discovery. The second one is the learned model composition approach which composes more efficient master networks from student models of past iterations through a network learning process. Critical sample discovery is developed to discover new critical unlabeled samples near the model decision boundary and provide the master model with lookahead access to these samples to enhance its guidance capability. For unsupervised learning, we have explored two major ideas. The first idea is transformed attention consistency where the network is learned based on selfsupervision information across images instead of within one single image. The second one is spatial assembly networks for image representation learning. We introduce a new learnable module, called spatial assembly network (SAN), which performs a learned re-organization and assembly of feature points and improves the network capabilities in handling spatial variations and structural changes of the image scene. Our experimental results on benchmark datasets demonstrate that our proposed methods have significantly improved the state-of-the-art in semi-supervised and unsupervised learning, outperforming existing methods by large margins.Includes bibliographical references
Person Re-identification by Local Maximal Occurrence Representation and Metric Learning
Person re-identification is an important technique towards automatic search
of a person's presence in a surveillance video. Two fundamental problems are
critical for person re-identification, feature representation and metric
learning. An effective feature representation should be robust to illumination
and viewpoint changes, and a discriminant metric should be learned to match
various person images. In this paper, we propose an effective feature
representation called Local Maximal Occurrence (LOMO), and a subspace and
metric learning method called Cross-view Quadratic Discriminant Analysis
(XQDA). The LOMO feature analyzes the horizontal occurrence of local features,
and maximizes the occurrence to make a stable representation against viewpoint
changes. Besides, to handle illumination variations, we apply the Retinex
transform and a scale invariant texture operator. To learn a discriminant
metric, we propose to learn a discriminant low dimensional subspace by
cross-view quadratic discriminant analysis, and simultaneously, a QDA metric is
learned on the derived subspace. We also present a practical computation method
for XQDA, as well as its regularization. Experiments on four challenging person
re-identification databases, VIPeR, QMUL GRID, CUHK Campus, and CUHK03, show
that the proposed method improves the state-of-the-art rank-1 identification
rates by 2.2%, 4.88%, 28.91%, and 31.55% on the four databases, respectively.Comment: This paper has been accepted by CVPR 2015. For source codes and
extracted features please visit
http://www.cbsr.ia.ac.cn/users/scliao/projects/lomo_xqda
Person re-identification via efficient inference in fully connected CRF
In this paper, we address the problem of person re-identification problem,
i.e., retrieving instances from gallery which are generated by the same person
as the given probe image. This is very challenging because the person's
appearance usually undergoes significant variations due to changes in
illumination, camera angle and view, background clutter, and occlusion over the
camera network. In this paper, we assume that the matched gallery images should
not only be similar to the probe, but also be similar to each other, under
suitable metric. We express this assumption with a fully connected CRF model in
which each node corresponds to a gallery and every pair of nodes are connected
by an edge. A label variable is associated with each node to indicate whether
the corresponding image is from target person. We define unary potential for
each node using existing feature calculation and matching techniques, which
reflect the similarity between probe and gallery image, and define pairwise
potential for each edge in terms of a weighed combination of Gaussian kernels,
which encode appearance similarity between pair of gallery images. The specific
form of pairwise potential allows us to exploit an efficient inference
algorithm to calculate the marginal distribution of each label variable for
this dense connected CRF. We show the superiority of our method by applying it
to public datasets and comparing with the state of the art.Comment: 7 pages, 4 figure
- β¦