1,025 research outputs found
Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect
Recently, the research interest of person re-identification (ReID) has
gradually turned to video-based methods, which acquire a person representation
by aggregating frame features of an entire video. However, existing video-based
ReID methods do not consider the semantic difference brought by the outputs of
different network stages, which potentially compromises the information
richness of the person features. Furthermore, traditional methods ignore
important relationship among frames, which causes information redundancy in
fusion along the time axis. To address these issues, we propose a novel general
temporal fusion framework to aggregate frame features on both semantic aspect
and time aspect. As for the semantic aspect, a multi-stage fusion network is
explored to fuse richer frame features at multiple semantic levels, which can
effectively reduce the information loss caused by the traditional single-stage
fusion. While, for the time axis, the existing intra-frame attention method is
improved by adding a novel inter-frame attention module, which effectively
reduces the information redundancy in temporal fusion by taking the
relationship among frames into consideration. The experimental results show
that our approach can effectively improve the video-based re-identification
accuracy, achieving the state-of-the-art performance
VID-Trans-ReID: Enhanced Video Transformers for Person Re-identification
Video-based person Re-identification (Re-ID) has received increasing attention recently due to its important role within surveillance video analysis. Video-based Re- ID expands upon earlier image-based methods by extracting person features temporally across multiple video image frames. The key challenge within person Re-ID is extracting a robust feature representation that is invariant to the challenges of pose and illumination variation across multiple camera viewpoints. Whilst most contemporary methods use a CNN based methodology, recent advances in vision transformer (ViT) architectures boost fine-grained feature discrimination via the use of both multi-head attention without any loss of feature robustness. To specifically enable ViT architectures to effectively address the challenges of video person Re-ID, we propose two novel modules constructs, Temporal Clip Shift and Shuffled (TCSS) and Video Patch Part Feature (VPPF), that boost the robustness of the resultant Re-ID feature representation. Furthermore, we combine our proposed approach with current best practices spanning both image and video based Re-ID including camera view embedding. Our proposed approach outperforms existing state-of-the-art work on the MARS, PRID2011, and iLIDS-VID Re-ID benchmark datasets achieving 96.36%, 96.63%, 94.67% rank-1 accuracy respectively and achieving 90.25% mAP on MARS
- …