2 research outputs found
Continuous Adaptation of Multi-Camera Person Identification Models through Sparse Non-redundant Representative Selection
The problem of image-base person identification/recognition is to provide an
identity to the image of an individual based on learned models that describe
his/her appearance. Most traditional person identification systems rely on
learning a static model on tediously labeled training data. Though labeling
manually is an indispensable part of a supervised framework, for a large scale
identification system labeling huge amount of data is a significant overhead.
For large multi-sensor data as typically encountered in camera networks,
labeling a lot of samples does not always mean more information, as redundant
images are labeled several times. In this work, we propose a convex
optimization based iterative framework that progressively and judiciously
chooses a sparse but informative set of samples for labeling, with minimal
overlap with previously labeled images. We also use a structure preserving
sparse reconstruction based classifier to reduce the training burden typically
seen in discriminative classifiers. The two stage approach leads to a novel
framework for online update of the classifiers involving only the incorporation
of new labeled data rather than any expensive training phase. We demonstrate
the effectiveness of our approach on multi-camera person re-identification
datasets, to demonstrate the feasibility of learning online classification
models in multi-camera big data applications. Using three benchmark datasets,
we validate our approach and demonstrate that our framework achieves superior
performance with significantly less amount of manual labeling
Where-and-When to Look: Deep Siamese Attention Networks for Video-based Person Re-identification
Video-based person re-identification (re-id) is a central application in
surveillance systems with significant concern in security. Matching persons
across disjoint camera views in their video fragments is inherently challenging
due to the large visual variations and uncontrolled frame rates. There are two
steps crucial to person re-id, namely discriminative feature learning and
metric learning. However, existing approaches consider the two steps
independently, and they do not make full use of the temporal and spatial
information in videos. In this paper, we propose a Siamese attention
architecture that jointly learns spatiotemporal video representations and their
similarity metrics. The network extracts local convolutional features from
regions of each frame, and enhance their discriminative capability by focusing
on distinct regions when measuring the similarity with another pedestrian
video. The attention mechanism is embedded into spatial gated recurrent units
to selectively propagate relevant features and memorize their spatial
dependencies through the network. The model essentially learns which parts
(\emph{where}) from which frames (\emph{when}) are relevant and distinctive for
matching persons and attaches higher importance therein. The proposed Siamese
model is end-to-end trainable to jointly learn comparable hidden
representations for paired pedestrian videos and their similarity value.
Extensive experiments on three benchmark datasets show the effectiveness of
each component of the proposed deep network while outperforming
state-of-the-art methods.Comment: Appearing in IEEE Transactions on Multimedia. arXiv admin note: text
overlap with arXiv:1606.0160