13 research outputs found
Dynamic Prototype Mask for Occluded Person Re-Identification
Although person re-identification has achieved an impressive improvement in
recent years, the common occlusion case caused by different obstacles is still
an unsettled issue in real application scenarios. Existing methods mainly
address this issue by employing body clues provided by an extra network to
distinguish the visible part. Nevertheless, the inevitable domain gap between
the assistant model and the ReID datasets has highly increased the difficulty
to obtain an effective and efficient model. To escape from the extra
pre-trained networks and achieve an automatic alignment in an end-to-end
trainable network, we propose a novel Dynamic Prototype Mask (DPM) based on two
self-evident prior knowledge. Specifically, we first devise a Hierarchical Mask
Generator which utilizes the hierarchical semantic to select the visible
pattern space between the high-quality holistic prototype and the feature
representation of the occluded input image. Under this condition, the occluded
representation could be well aligned in a selected subspace spontaneously.
Then, to enrich the feature representation of the high-quality holistic
prototype and provide a more complete feature space, we introduce a Head Enrich
Module to encourage different heads to aggregate different patterns
representation in the whole image. Extensive experimental evaluations conducted
on occluded and holistic person re-identification benchmarks demonstrate the
superior performance of the DPM over the state-of-the-art methods. The code is
released at https://github.com/stone96123/DPM.Comment: Accepted by ACM MM 202
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained
models to downstream tasks by adding task-specific tokens. In terms of
vision-language pre-trained (VLP) models, prompt tuning often requires a large
number of learnable tokens to bridge the gap between the pre-training and
downstream tasks, which greatly exacerbates the already high computational
overhead. In this paper, we revisit the principle of prompt tuning for
Transformer-based VLP models and reveal that the impact of soft prompt tokens
can be actually approximated via independent information diffusion steps,
thereby avoiding the expensive global attention modeling and reducing the
computational complexity to a large extent. Based on this finding, we propose a
novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer
learning. To validate APT, we apply it to two representative VLP models, namely
ViLT and METER, and conduct extensive experiments on a bunch of downstream
tasks. Meanwhile, the generalization of APT is also validated on CLIP for image
classification. The experimental results not only show the superior performance
gains and computation efficiency of APT against the conventional prompt tuning
methods, e.g., +6.6% accuracy and -64.62% additional computation overhead on
METER, but also confirm its merits over other parameter-efficient transfer
learning approaches
Bag of Features with Dense Sampling for Visual Tracking ⋆
The bag-of-feature model has become a state-of-the-art method of visual classification. Visual codebooks can be used to capture image statistical information for object detection and classification, which is extracted from local image patches and based on the quantization of robust appearance descriptors. In this paper, more information of target objects can be captured by dense sampling rather than sparsely sampling. Then a robust visual tracking method is proposed based on dense sampling and bag of features. Firstly, local image patches are densely extracted by sliding windows and represented as invariant descriptors. Secondly, visual codebooks are generated by fast clustering algorithms such as hierarchical k-means. Therefore, the object region and candidate regions are represented by the bag-of-feature model with the learnt codebooks. After that, tracking can operate in a Bayesian inference framework. The bag-of-feature tracking method with dense sampling is adaptive and flexible. It works independently in many situations without the complement of existed tracking algorithms. The experiments on various challenging videos demonstrate that the proposed tracker outperforms several state-of-art algorithms
Robust visual tracking via part-based sparsity model
Conference Name:2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013. Conference Address: Vancouver, BC, Canada. Time:May 26, 2013 - May 31, 2013.IEE Signal Processing SocietyThe sparse representation has been widely used in many areas including visual tracking. The part-based representation performs outstandingly by using non-holistic templates to against occlusion. This paper combined them and proposed a robust object tracking method using part-based sparsity model for tracking an object in a video sequence. In the proposed model, one object is represented by image patches. The candidates of these patches are sparsely represented in the space which is spanned by the patch templates and trivial templates. The part-based method takes the spatial information of each patch into consideration, where the vote maps of multiple patches are used. Furthermore, the update scheme keeps the representative templates of each part dynamically. Therefore, trackers can effectively deal with the changes of appearances and heavy occlusion. On various public benchmark videos, the abundant results of experiments demonstrate that the proposed tracking method outperforms many existing state-of-the-arts algorithms. ? 2013 IEEE
Dual Distribution Alignment Network for Generalizable Person Re-Identification
Domain generalization (DG) offers a preferable real-world setting for Person Re-Identification (Re-ID), which trains a model using multiple source domain datasets and expects it to perform well in an unseen target domain without any model updating. Unfortunately, most DG approaches are designed explicitly for classification tasks, which fundamentally differs from the retrieval task Re-ID. Moreover, existing applications of DG in Re-ID cannot correctly handle the massive variation among Re-ID datasets. In this paper, we identify two fundamental challenges in DG for Person Re-ID: domain-wise variations and identity-wise similarities. To this end, we propose an end-to-end Dual Distribution Alignment Network (DDAN) to learn domain-invariant features with dual-level constraints: the domain-wise adversarial feature learning and the identity-wise similarity enhancement. These constraints effectively reduce the domain-shift among multiple source domains further while agreeing to real-world scenarios. We evaluate our method in a large-scale DG Re-ID benchmark and compare it with various cutting-edge DG approaches. Quantitative results show that DDAN achieves state-of-the-art performance