34 research outputs found
Progressive Cross-camera Soft-label Learning for Semi-supervised Person Re-identification
In this paper, we focus on the semi-supervised person re-identification
(Re-ID) case, which only has the intra-camera (within-camera) labels but not
inter-camera (cross-camera) labels. In real-world applications, these
intra-camera labels can be readily captured by tracking algorithms or few
manual annotations, when compared with cross-camera labels. In this case, it is
very difficult to explore the relationships between cross-camera persons in the
training stage due to the lack of cross-camera label information. To deal with
this issue, we propose a novel Progressive Cross-camera Soft-label Learning
(PCSL) framework for the semi-supervised person Re-ID task, which can generate
cross-camera soft-labels and utilize them to optimize the network. Concretely,
we calculate an affinity matrix based on person-level features and adapt them
to produce the similarities between cross-camera persons (i.e., cross-camera
soft-labels). To exploit these soft-labels to train the network, we investigate
the weighted cross-entropy loss and the weighted triplet loss from the
classification and discrimination perspectives, respectively. Particularly, the
proposed framework alternately generates progressive cross-camera soft-labels
and gradually improves feature representations in the whole learning course.
Extensive experiments on five large-scale benchmark datasets show that PCSL
significantly outperforms the state-of-the-art unsupervised methods that employ
labeled source domains or the images generated by the GAN-based models.
Furthermore, the proposed method even has a competitive performance with
respect to deep supervised Re-ID methods.Comment: Accepted by IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT
Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment
Multi-object tracking (MOT) has profound applications in a variety of fields,
including surveillance, sports analytics, self-driving, and cooperative
robotics. Despite considerable advancements, existing MOT methodologies tend to
falter when faced with non-uniform movements, occlusions, and
appearance-reappearance scenarios of the objects. Recognizing this inadequacy,
we put forward an integrated MOT method that not only marries object detection
and identity linkage within a singular, end-to-end trainable framework but also
equips the model with the ability to maintain object identity links over long
periods of time. Our proposed model, named STMMOT, is built around four key
modules: 1) candidate proposal generation, which generates object proposals via
a vision-transformer encoder-decoder architecture that detects the object from
each frame in the video; 2) scale variant pyramid, a progressive pyramid
structure to learn the self-scale and cross-scale similarities in multi-scale
feature maps; 3) spatio-temporal memory encoder, extracting the essential
information from the memory associated with each object under tracking; and 4)
spatio-temporal memory decoder, simultaneously resolving the tasks of object
detection and identity association for MOT. Our system leverages a robust
spatio-temporal memory module that retains extensive historical observations
and effectively encodes them using an attention-based aggregator. The
uniqueness of STMMOT lies in representing objects as dynamic query embeddings
that are updated continuously, which enables the prediction of object states
with attention mechanisms and eradicates the need for post-processing
Large-scale Fully-Unsupervised Re-Identification
Fully-unsupervised Person and Vehicle Re-Identification have received
increasing attention due to their broad applicability in surveillance,
forensics, event understanding, and smart cities, without requiring any manual
annotation. However, most of the prior art has been evaluated in datasets that
have just a couple thousand samples. Such small-data setups often allow the use
of costly techniques in time and memory footprints, such as Re-Ranking, to
improve clustering results. Moreover, some previous work even pre-selects the
best clustering hyper-parameters for each dataset, which is unrealistic in a
large-scale fully-unsupervised scenario. In this context, this work tackles a
more realistic scenario and proposes two strategies to learn from large-scale
unlabeled data. The first strategy performs a local neighborhood sampling to
reduce the dataset size in each iteration without violating neighborhood
relationships. A second strategy leverages a novel Re-Ranking technique, which
has a lower time upper bound complexity and reduces the memory complexity from
O(n^2) to O(kn) with k << n. To avoid the pre-selection of specific
hyper-parameter values for the clustering algorithm, we also present a novel
scheduling algorithm that adjusts the density parameter during training, to
leverage the diversity of samples and keep the learning robust to noisy
labeling. Finally, due to the complementary knowledge learned by different
models, we also introduce a co-training strategy that relies upon the
permutation of predicted pseudo-labels, among the backbones, with no need for
any hyper-parameters or weighting optimization. The proposed methodology
outperforms the state-of-the-art methods in well-known benchmarks and in the
challenging large-scale Veri-Wild dataset, with a faster and memory-efficient
Re-Ranking strategy, and a large-scale, noisy-robust, and ensemble-based
learning approach.Comment: This paper has been submitted for possible publication in an IEEE
Transaction
Understanding Complex Human Behaviour in Images and Videos.
Understanding human motions and activities in images and videos is an important problem in many application domains, including surveillance, robotics, video indexing, and sports analysis. Although much progress has been made in classifying single person's activities in simple videos, little efforts have been made toward the interpretation of behaviors of multiple people in natural videos. In this thesis, I will present my research endeavor toward the understanding of behaviors of multiple people in natural images and videos. I identify four major challenges in this problem: i) identifying individual properties of people in videos, ii) modeling and recognizing the behavior of multiple people, iii) understanding human activities in multiple levels of resolutions and iv) learning characteristic patterns of interactions between people or people and surrounding environment. I discuss how we solve these challenging problems using various computer vision and machine learning technologies. I conclude with final remarks, observations, and possible future research directions.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99956/1/wgchoi_1.pd
Discovering Discriminative Geometric Features with Self-Supervised Attention for Vehicle Re-Identification and Beyond
In the literature of vehicle re-identification (ReID), intensive manual
labels such as landmarks, critical parts or semantic segmentation masks are
often required to improve the performance. Such extra information helps to
detect locally geometric features as a part of representation learning for
vehicles. In contrast, in this paper, we aim to address the challenge of {\em
automatically} learning to detect geometric features as landmarks {\em with no
extra labels}. To the best of our knowledge, we are the {\em first} to
successfully learn discriminative geometric features for vehicle ReID based on
self-supervised attention. Specifically, we implement an end-to-end trainable
deep network architecture consisting of three branches: (1) a global branch as
backbone for image feature extraction, (2) an attentional branch for producing
attention masks, and (3) a self-supervised branch for regularizing the
attention learning with rotated images to locate geometric features. %Our
network design naturally leads to an end-to-end multi-task joint optimization.
We conduct comprehensive experiments on three benchmark datasets for vehicle
ReID, \ie VeRi-776, CityFlow-ReID, and VehicleID, and demonstrate our
state-of-the-art performance. %of our approach with the capability of capturing
informative vehicle parts with no corresponding manual labels. We also show the
good generalization of our approach in other ReID tasks such as person ReID and
multi-target multi-camera (MTMC) vehicle tracking. {\em Our demo code is
attached in the supplementary file.