25 research outputs found
Clip-level feature aggregation : a key factor for video-based person re-identification
In the task of video-based person re-identification, features
of persons in the query and gallery sets are compared to search the
best match. Generally, most existing methods aggregate the frame-level
features together using a temporal method to generate the clip-level fea-
tures, instead of the sequence-level representations. In this paper, we
propose a new method that aggregates the clip-level features to obtain
the sequence-level representations of persons, which consists of two parts,
i.e., Average Aggregation Strategy (AAS) and Raw Feature Utilization
(RFU). AAS makes use of all frames in a video sequence to generate
a better representation of a person, while RFU investigates how batch
normalization operation influences feature representations in person re-
identification. The experimental results demonstrate that our method
can boost the performance of existing models for better accuracy. In
particular, we achieve 87.7% rank-1 and 82.3% mAP on MARS dataset
without any post-processing procedure, which outperforms the existing
state-of-the-art
Person Re-identification in Videos by Analyzing Spatio-temporal Tubes
Typical person re-identification frameworks search for k best matches in a gallery of images that are often collected in varying conditions. The gallery usually contains image sequences for video re-identification applications. However, such a process is time consuming as video re-identification involves carrying out the matching process multiple times. In this paper, we propose a new method that extracts spatio-temporal frame sequences or tubes of moving persons and performs the re-identification in quick time. Initially, we apply a binary classifier to remove noisy images from the input query tube. In the next step, we use a key-pose detection-based query minimization technique. Finally, a hierarchical re-identification framework is proposed and used to rank the output tubes. Experiments with publicly available video re-identification datasets reveal that our framework is better than existing methods. It ranks the tubes with an average increase in the CMC accuracy of 6-8% across multiple datasets. Also, our method significantly reduces the number of false positives. A new video re-identification dataset, named Tube-based Re-identification Video Dataset (TRiViD), has been prepared with an aim to help the re-identification research community
Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect
Recently, the research interest of person re-identification (ReID) has
gradually turned to video-based methods, which acquire a person representation
by aggregating frame features of an entire video. However, existing video-based
ReID methods do not consider the semantic difference brought by the outputs of
different network stages, which potentially compromises the information
richness of the person features. Furthermore, traditional methods ignore
important relationship among frames, which causes information redundancy in
fusion along the time axis. To address these issues, we propose a novel general
temporal fusion framework to aggregate frame features on both semantic aspect
and time aspect. As for the semantic aspect, a multi-stage fusion network is
explored to fuse richer frame features at multiple semantic levels, which can
effectively reduce the information loss caused by the traditional single-stage
fusion. While, for the time axis, the existing intra-frame attention method is
improved by adding a novel inter-frame attention module, which effectively
reduces the information redundancy in temporal fusion by taking the
relationship among frames into consideration. The experimental results show
that our approach can effectively improve the video-based re-identification
accuracy, achieving the state-of-the-art performance