3 research outputs found
Revisiting Cross Modal Retrieval
This paper proposes a cross-modal retrieval system that leverages on image
and text encoding. Most multimodal architectures employ separate networks for
each modality to capture the semantic relationship between them. However, in
our work image-text encoding can achieve comparable results in terms of
cross-modal retrieval without having to use a separate network for each
modality. We show that text encodings can capture semantic relationships
between multiple modalities. In our knowledge, this work is the first of its
kind in terms of employing a single network and fused image-text embedding for
cross-modal retrieval. We evaluate our approach on two famous multimodal
datasets: MS-COCO and Flickr30K.Comment: 14 pages. Under review at ECCVW (MULA 2018
Multitask Text-to-Visual Embedding with Titles and Clickthrough Data
Text-visual (or called semantic-visual) embedding is a central problem in
vision-language research. It typically involves mapping of an image and a text
description to a common feature space through a CNN image encoder and a RNN
language encoder. In this paper, we propose a new method for learning
text-visual embedding using both image titles and click-through data from an
image search engine. We also propose a new triplet loss function by modeling
positive awareness of the embedding, and introduce a novel mini-batch-based
hard negative sampling approach for better data efficiency in the learning
process. Experimental results show that our proposed method outperforms
existing methods, and is also effective for real-world text-to-visual
retrieval.Comment: 4 pages. Language and Vision Workshop, in conjunction with CVPR 201
Learning Inward Scaled Hypersphere Embedding: Exploring Projections in Higher Dimensions
Majority of the current dimensionality reduction or retrieval techniques rely
on embedding the learned feature representations onto a computable metric
space. Once the learned features are mapped, a distance metric aids the
bridging of gaps between similar instances. Since the scaled projection is not
exploited in these methods, discriminative embedding onto a hyperspace becomes
a challenge. In this paper, we propose to inwardly scale feature
representations in proportional to projecting them onto a hypersphere manifold
for discriminative analysis. We further propose a novel, yet simpler,
convolutional neural network based architecture and extensively evaluate the
proposed methodology in the context of classification and retrieval tasks
obtaining results comparable to state-of-the-art techniques