Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Cham, Tat-Jen; Hu, Minghui; Suganthan, Ponnuthurai N.; Tao, Dacheng; Wang, Chaoyue; Yang, Zuopeng; Zheng, Chuanxia; Zheng, Heliang

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Authors: Tat-Jen Cham
Minghui Hu
Ponnuthurai N. Suganthan
Dacheng Tao
Chaoyue Wang
Zuopeng Yang
Chuanxia Zheng
Heliang Zheng
Publication date: 27 November 2022
Publisher

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.14842

Last time updated on 30/12/2022