Visual-semantic embedding aims to learn a joint embedding space where related
video and sentence instances are located close to each other. Most existing
methods put instances in a single embedding space. However, they struggle to
embed instances due to the difficulty of matching visual dynamics in videos to
textual features in sentences. A single space is not enough to accommodate
various videos and sentences. In this paper, we propose a novel framework that
maps instances into multiple individual embedding spaces so that we can capture
multiple relationships between instances, leading to compelling video
retrieval. We propose to produce a final similarity between instances by fusing
similarities measured in each embedding space using a weighted sum strategy. We
determine the weights according to a sentence. Therefore, we can flexibly
emphasize an embedding space. We conducted sentence-to-video retrieval
experiments on a benchmark dataset. The proposed method achieved superior
performance, and the results are competitive to state-of-the-art methods. These
experimental results demonstrated the effectiveness of the proposed multiple
embedding approach compared to existing methods.Comment: 8 pages, 5 figure