1,647 research outputs found
Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification
Multi-grained features extracted from convolutional neural networks (CNNs)
have demonstrated their strong discrimination ability in supervised person
re-identification (Re-ID) tasks. Inspired by them, this work investigates the
way of extracting multi-grained features from a pure transformer network to
address the unsupervised Re-ID problem that is label-free but much more
challenging. To this end, we build a dual-branch network architecture based
upon a modified Vision Transformer (ViT). The local tokens output in each
branch are reshaped and then uniformly partitioned into multiple stripes to
generate part-level features, while the global tokens of two branches are
averaged to produce a global feature. Further, based upon offline-online
associated camera-aware proxies (O2CAP) that is a top-performing unsupervised
Re-ID method, we define offline and online contrastive learning losses with
respect to both global and part-level features to conduct unsupervised
learning. Extensive experiments on three person Re-ID datasets show that the
proposed method outperforms state-of-the-art unsupervised methods by a
considerable margin, greatly mitigating the gap to supervised counterparts.
Code will be available soon at https://github.com/RikoLi/WACV23-workshop-TMGF.Comment: Accepted by WACVW 2023, 3rd Workshop on Real-World Surveillance:
Applications and Challenge
Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
Online Multi-Object Tracking (MOT) from videos is a challenging computer
vision task which has been extensively studied for decades. Most of the
existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm
combined with popular machine learning approaches which largely reduce the
human effort to tune algorithm parameters. However, the commonly used
supervised learning approaches require the labeled data (e.g., bounding boxes),
which is expensive for videos. Also, the TBD framework is usually suboptimal
since it is not end-to-end, i.e., it considers the task as detection and
tracking, but not jointly. To achieve both label-free and end-to-end learning
of MOT, we propose a Tracking-by-Animation framework, where a differentiable
neural model first tracks objects from input frames and then animates these
objects into reconstructed frames. Learning is then driven by the
reconstruction error through backpropagation. We further propose a
Reprioritized Attentive Tracking to improve the robustness of data association.
Experiments conducted on both synthetic and real video datasets show the
potential of the proposed model. Our project page is publicly available at:
https://github.com/zhen-he/tracking-by-animationComment: CVPR 201
Online Video Instance Segmentation via Robust Context Fusion
Video instance segmentation (VIS) aims at classifying, segmenting and
tracking object instances in video sequences. Recent transformer-based neural
networks have demonstrated their powerful capability of modeling
spatio-temporal correlations for the VIS task. Relying on video- or clip-level
input, they suffer from high latency and computational cost. We propose a
robust context fusion network to tackle VIS in an online fashion, which
predicts instance segmentation frame-by-frame with a few preceding frames. To
acquire the precise and temporal-consistent prediction for each frame
efficiently, the key idea is to fuse effective and compact context from
reference frames into the target frame. Considering the different effects of
reference and target frames on the target prediction, we first summarize
contextual features through importance-aware compression. A transformer encoder
is adopted to fuse the compressed context. Then, we leverage an
order-preserving instance embedding to convey the identity-aware information
and correspond the identities to predicted instance masks. We demonstrate that
our robust fusion network achieves the best performance among existing online
VIS methods and is even better than previously published clip-level methods on
the Youtube-VIS 2019 and 2021 benchmarks. In addition, visual objects often
have acoustic signatures that are naturally synchronized with them in
audio-bearing video recordings. By leveraging the flexibility of our context
fusion network on multi-modal data, we further investigate the influence of
audios on the video-dense prediction task, which has never been discussed in
existing works. We build up an Audio-Visual Instance Segmentation dataset, and
demonstrate that acoustic signals in the wild scenarios could benefit the VIS
task
- …