C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
  Text-Video Retrieval

Chuang, Yung-Sung; Feris, Rogerio; Glass, James; Harwath, David; Karlinsky, Leonid; Kingsbury, Brian; Kuehne, Hilde; Rouditchenko, Andrew; Shvetsova, Nina; Thomas, Samuel

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Authors: Yung-Sung Chuang
Rogerio Feris
James Glass
David Harwath
Leonid Karlinsky
Brian Kingsbury
Hilde Kuehne
Andrew Rouditchenko
Nina Shvetsova
Samuel Thomas
Publication date: 7 October 2022
Publisher

Abstract

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2210.03625

Last time updated on 24/11/2022