56 research outputs found
Saliency Guided End-to-End Learning for Weakly Supervised Object Detection
Weakly supervised object detection (WSOD), which is the problem of learning
detectors using only image-level labels, has been attracting more and more
interest. However, this problem is quite challenging due to the lack of
location supervision. To address this issue, this paper integrates saliency
into a deep architecture, in which the location in- formation is explored both
explicitly and implicitly. Specifically, we select highly confident object pro-
posals under the guidance of class-specific saliency maps. The location
information, together with semantic and saliency information, of the selected
proposals are then used to explicitly supervise the network by imposing two
additional losses. Meanwhile, a saliency prediction sub-network is built in the
architecture. The prediction results are used to implicitly guide the
localization procedure. The entire network is trained end-to-end. Experiments
on PASCAL VOC demonstrate that our approach outperforms all state-of-the-arts.Comment: Accepted to appear in IJCAI 201
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification
This work aims to adapt large-scale pre-trained vision-language models, such
as contrastive language-image pretraining (CLIP), to enhance the performance of
object reidentification (Re-ID) across various supervision settings. Although
prompt learning has enabled a recent work named CLIP-ReID to achieve promising
performance, the underlying mechanisms and the necessity of prompt learning
remain unclear due to the absence of semantic labels in ReID tasks. In this
work, we first analyze the role prompt learning in CLIP-ReID and identify its
limitations. Based on our investigations, we propose a simple yet effective
approach to adapt CLIP for supervised object Re-ID. Our approach directly
fine-tunes the image encoder of CLIP using a prototypical contrastive learning
(PCL) loss, eliminating the need for prompt learning. Experimental results on
both person and vehicle Re-ID datasets demonstrate the competitiveness of our
method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP
fine-tuning approach to unsupervised scenarios, where we achieve state-of-the
art performance
Learning Intra and Inter-Camera Invariance for Isolated Camera Supervised Person Re-identification
Supervised person re-identification assumes that a person has images captured
under multiple cameras. However when cameras are placed in distance, a person
rarely appears in more than one camera. This paper thus studies person re-ID
under such isolated camera supervised (ISCS) setting. Instead of trying to
generate fake cross-camera features like previous methods, we explore a novel
perspective by making efficient use of the variation in training data. Under
ISCS setting, a person only has limited images from a single camera, so the
camera bias becomes a critical issue confounding ID discrimination.
Cross-camera images are prone to being recognized as different IDs simply by
camera style. To eliminate the confounding effect of camera bias, we propose to
learn both intra- and inter-camera invariance under a unified framework. First,
we construct style-consistent environments via clustering, and perform
prototypical contrastive learning within each environment. Meanwhile, strongly
augmented images are contrasted with original prototypes to enforce
intra-camera augmentation invariance. For inter-camera invariance, we further
design a much improved variant of multi-camera negative loss that optimizes the
distance of multi-level negatives. The resulting model learns to be invariant
to both subtle and severe style variation within and cross-camera. On multiple
benchmarks, we conduct extensive experiments and validate the effectiveness and
superiority of the proposed method. Code will be available at
https://github.com/Terminator8758/IICI.Comment: ACM MultiMedia 202
Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification
Multi-grained features extracted from convolutional neural networks (CNNs)
have demonstrated their strong discrimination ability in supervised person
re-identification (Re-ID) tasks. Inspired by them, this work investigates the
way of extracting multi-grained features from a pure transformer network to
address the unsupervised Re-ID problem that is label-free but much more
challenging. To this end, we build a dual-branch network architecture based
upon a modified Vision Transformer (ViT). The local tokens output in each
branch are reshaped and then uniformly partitioned into multiple stripes to
generate part-level features, while the global tokens of two branches are
averaged to produce a global feature. Further, based upon offline-online
associated camera-aware proxies (O2CAP) that is a top-performing unsupervised
Re-ID method, we define offline and online contrastive learning losses with
respect to both global and part-level features to conduct unsupervised
learning. Extensive experiments on three person Re-ID datasets show that the
proposed method outperforms state-of-the-art unsupervised methods by a
considerable margin, greatly mitigating the gap to supervised counterparts.
Code will be available soon at https://github.com/RikoLi/WACV23-workshop-TMGF.Comment: Accepted by WACVW 2023, 3rd Workshop on Real-World Surveillance:
Applications and Challenge
Promoting Public Participation in Post-Disaster Construction through Wechat Platform
Purpose - How could memory, heritage and post-disaster construction integrated in practice? The purpose of this paper is to introduce our approach in public participation of reconstruction plan, after a raging fire destroyed part of the historic town of Shangri-la, China.
Approach – We develop two kind of crowd sourcing platform to collect and also present memory of the vanishing streets which were distroyed completely by fire. One is on Wechat platform. Through secondary development on Wechat Platform, we built a public service account that allowed users to upload photos, hand-painted pictures, and text, all of the files can be saved automaticly in our database. The other platform in on web. The website is designed for users to upload photos based on location where they were taken. All the images collected from the two platforms can be open accessed viewed with location information, which had been sort out by volunteers.
The wechat platform is also used to communicate and provide education and information of the historic town to promote awareness of the heritage value. Users can send text to the public account, without privacy risk.
Findings – Spreading with help from a local non-government organization, the invations of the wechat public service account received amazing amount of attention, which, according to automatic web statistics, reached up to 40,000. About 150 people followed the Wechat public account. At last we received nearly 1000 photos and hand-painted pictures. About half of our users are from Shangrila local community, Their uploaded files including historical photos of the community, providing us local perspective with long period of concern. The other half users come from travlers from all over the world, mostly from China but also european people. Their photos and paintings also contribute to the memory construction.
Implications –The widespread use of smart mobile devices can make individuals more active as participants of public fairs, with the premise of carefully designed infrastructure. In this way, new technologies may contribute to a people centred principle in our conservation and design process.
Value – Our approach is so-called Volunteered Geographic Information(VGI)(Goodchild,2007) in collecting memory fragments for post-disaster construction. By convenience of uploading photos and texts from mobile devices, we successfully involved local people and travlers‘participation. The case might bring insight into the field of public participation practice
Camera-aware Proxies for Unsupervised Person Re-Identification
This paper tackles the purely unsupervised person re-identification (Re-ID)
problem that requires no annotations. Some previous methods adopt clustering
techniques to generate pseudo labels and use the produced labels to train Re-ID
models progressively. These methods are relatively simple but effective.
However, most clustering-based methods take each cluster as a pseudo identity
class, neglecting the large intra-ID variance caused mainly by the change of
camera views. To address this issue, we propose to split each single cluster
into multiple proxies and each proxy represents the instances coming from the
same camera. These camera-aware proxies enable us to deal with large intra-ID
variance and generate more reliable pseudo labels for learning. Based on the
camera-aware proxies, we design both intra- and inter-camera contrastive
learning components for our Re-ID model to effectively learn the ID
discrimination ability within and across cameras. Meanwhile, a proxy-balanced
sampling strategy is also designed, which facilitates our learning further.
Extensive experiments on three large-scale Re-ID datasets show that our
proposed approach outperforms most unsupervised methods by a significant
margin. Especially, on the challenging MSMT17 dataset, we gain Rank-1
and mAP improvements when compared to the second place. Code is
available at: \texttt{https://github.com/Terminator8758/CAP-master}.Comment: Accepted to AAAI 2021. Code is available at:
https://github.com/Terminator8758/CAP-maste
- …