Sounding Video Generator: A Unified Framework for Text-guided Sounding
  Video Generation

Chen, Sihan; Liu, Jiawei; Liu, Jing; Wang, Weining; Zhu, Xinxin

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

Authors: Sihan Chen
Jiawei Liu
Jing Liu
Weining Wang
Xinxin Zhu
Publication date: 29 March 2023
Publisher

Abstract

As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSetCap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing textto-video generation methods as well as audio generation methods on Kinetics and VAS datasets

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.16541

Last time updated on 02/04/2023