DNN-based cross-modal retrieval has become a research hotspot, by which users
can search results across various modalities like image and text. However,
existing methods mainly focus on the pairwise correlation and reconstruction
error of labeled data. They ignore the semantically similar and dissimilar
constraints between different modalities, and cannot take advantage of
unlabeled data. This paper proposes Cross-modal Deep Metric Learning with
Multi-task Regularization (CDMLMR), which integrates quadruplet ranking loss
and semi-supervised contrastive loss for modeling cross-modal semantic
similarity in a unified multi-task learning architecture. The quadruplet
ranking loss can model the semantically similar and dissimilar constraints to
preserve cross-modal relative similarity ranking information. The
semi-supervised contrastive loss is able to maximize the semantic similarity on
both labeled and unlabeled data. Compared to the existing methods, CDMLMR
exploits not only the similarity ranking information but also unlabeled
cross-modal data, and thus boosts cross-modal retrieval accuracy.Comment: Revision: Added reference [7] 6 pages, 1 figure, to appear in the
proceedings of the IEEE International Conference on Multimedia and Expo
(ICME), Jul 10, 2017 - Jul 14, 2017, Hong Kong, Hong Kon