1 research outputs found
Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck
Deep generative models have led to significant advances in cross-modal
generation such as text-to-image synthesis. Training these models typically
requires paired data with direct correspondence between modalities. We
introduce the novel problem of translating instances from one modality to
another without paired data by leveraging an intermediate modality shared by
the two other modalities. To demonstrate this, we take the problem of
translating images to speech. In this case, one could leverage disjoint
datasets with one shared modality, e.g., image-text pairs and text-speech
pairs, with text as the shared modality. We call this problem "skip-modal
generation" because the shared modality is skipped during the generation
process. We propose a multimodal information bottleneck approach that learns
the correspondence between modalities from unpaired data (image and speech) by
leveraging the shared modality (text). We address fundamental challenges of
skip-modal generation: 1) learning multimodal representations using a single
model, 2) bridging the domain gap between two unrelated datasets, and 3)
learning the correspondence between modalities from unpaired data. We show
qualitative results on image-to-speech synthesis; this is the first time such
results have been reported in the literature. We also show that our approach
improves performance on traditional cross-modal generation, suggesting that it
improves data efficiency in solving individual tasks.Comment: ICCV 201