34 research outputs found

    Progressive Cross-camera Soft-label Learning for Semi-supervised Person Re-identification

    Full text link
    In this paper, we focus on the semi-supervised person re-identification (Re-ID) case, which only has the intra-camera (within-camera) labels but not inter-camera (cross-camera) labels. In real-world applications, these intra-camera labels can be readily captured by tracking algorithms or few manual annotations, when compared with cross-camera labels. In this case, it is very difficult to explore the relationships between cross-camera persons in the training stage due to the lack of cross-camera label information. To deal with this issue, we propose a novel Progressive Cross-camera Soft-label Learning (PCSL) framework for the semi-supervised person Re-ID task, which can generate cross-camera soft-labels and utilize them to optimize the network. Concretely, we calculate an affinity matrix based on person-level features and adapt them to produce the similarities between cross-camera persons (i.e., cross-camera soft-labels). To exploit these soft-labels to train the network, we investigate the weighted cross-entropy loss and the weighted triplet loss from the classification and discrimination perspectives, respectively. Particularly, the proposed framework alternately generates progressive cross-camera soft-labels and gradually improves feature representations in the whole learning course. Extensive experiments on five large-scale benchmark datasets show that PCSL significantly outperforms the state-of-the-art unsupervised methods that employ labeled source domains or the images generated by the GAN-based models. Furthermore, the proposed method even has a competitive performance with respect to deep supervised Re-ID methods.Comment: Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT

    Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

    Full text link
    Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing

    Large-scale Fully-Unsupervised Re-Identification

    Full text link
    Fully-unsupervised Person and Vehicle Re-Identification have received increasing attention due to their broad applicability in surveillance, forensics, event understanding, and smart cities, without requiring any manual annotation. However, most of the prior art has been evaluated in datasets that have just a couple thousand samples. Such small-data setups often allow the use of costly techniques in time and memory footprints, such as Re-Ranking, to improve clustering results. Moreover, some previous work even pre-selects the best clustering hyper-parameters for each dataset, which is unrealistic in a large-scale fully-unsupervised scenario. In this context, this work tackles a more realistic scenario and proposes two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each iteration without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n^2) to O(kn) with k << n. To avoid the pre-selection of specific hyper-parameter values for the clustering algorithm, we also present a novel scheduling algorithm that adjusts the density parameter during training, to leverage the diversity of samples and keep the learning robust to noisy labeling. Finally, due to the complementary knowledge learned by different models, we also introduce a co-training strategy that relies upon the permutation of predicted pseudo-labels, among the backbones, with no need for any hyper-parameters or weighting optimization. The proposed methodology outperforms the state-of-the-art methods in well-known benchmarks and in the challenging large-scale Veri-Wild dataset, with a faster and memory-efficient Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based learning approach.Comment: This paper has been submitted for possible publication in an IEEE Transaction

    Understanding Complex Human Behaviour in Images and Videos.

    Full text link
    Understanding human motions and activities in images and videos is an important problem in many application domains, including surveillance, robotics, video indexing, and sports analysis. Although much progress has been made in classifying single person's activities in simple videos, little efforts have been made toward the interpretation of behaviors of multiple people in natural videos. In this thesis, I will present my research endeavor toward the understanding of behaviors of multiple people in natural images and videos. I identify four major challenges in this problem: i) identifying individual properties of people in videos, ii) modeling and recognizing the behavior of multiple people, iii) understanding human activities in multiple levels of resolutions and iv) learning characteristic patterns of interactions between people or people and surrounding environment. I discuss how we solve these challenging problems using various computer vision and machine learning technologies. I conclude with final remarks, observations, and possible future research directions.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99956/1/wgchoi_1.pd

    Discovering Discriminative Geometric Features with Self-Supervised Attention for Vehicle Re-Identification and Beyond

    Full text link
    In the literature of vehicle re-identification (ReID), intensive manual labels such as landmarks, critical parts or semantic segmentation masks are often required to improve the performance. Such extra information helps to detect locally geometric features as a part of representation learning for vehicles. In contrast, in this paper, we aim to address the challenge of {\em automatically} learning to detect geometric features as landmarks {\em with no extra labels}. To the best of our knowledge, we are the {\em first} to successfully learn discriminative geometric features for vehicle ReID based on self-supervised attention. Specifically, we implement an end-to-end trainable deep network architecture consisting of three branches: (1) a global branch as backbone for image feature extraction, (2) an attentional branch for producing attention masks, and (3) a self-supervised branch for regularizing the attention learning with rotated images to locate geometric features. %Our network design naturally leads to an end-to-end multi-task joint optimization. We conduct comprehensive experiments on three benchmark datasets for vehicle ReID, \ie VeRi-776, CityFlow-ReID, and VehicleID, and demonstrate our state-of-the-art performance. %of our approach with the capability of capturing informative vehicle parts with no corresponding manual labels. We also show the good generalization of our approach in other ReID tasks such as person ReID and multi-target multi-camera (MTMC) vehicle tracking. {\em Our demo code is attached in the supplementary file.
    corecore