2,207 research outputs found
Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification
Person re-identification (re-id) aims to match pedestrians observed by
disjoint camera views. It attracts increasing attention in computer vision due
to its importance to surveillance system. To combat the major challenge of
cross-view visual variations, deep embedding approaches are proposed by
learning a compact feature space from images such that the Euclidean distances
correspond to their cross-view similarity metric. However, the global Euclidean
distance cannot faithfully characterize the ideal similarity in a complex
visual feature space because features of pedestrian images exhibit unknown
distributions due to large variations in poses, illumination and occlusion.
Moreover, intra-personal training samples within a local range are robust to
guide deep embedding against uncontrolled variations, which however, cannot be
captured by a global Euclidean distance. In this paper, we study the problem of
person re-id by proposing a novel sampling to mine suitable \textit{positives}
(i.e. intra-class) within a local range to improve the deep embedding in the
context of large intra-class variations. Our method is capable of learning a
deep similarity metric adaptive to local sample structure by minimizing each
sample's local distances while propagating through the relationship between
samples to attain the whole intra-class minimization. To this end, a novel
objective function is proposed to jointly optimize similarity metric learning,
local positive mining and robust deep embedding. This yields local
discriminations by selecting local-ranged positive samples, and the learned
features are robust to dramatic intra-class variations. Experiments on
benchmarks show state-of-the-art results achieved by our method.Comment: Published on Pattern Recognitio
SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods
- …