Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Zhu, Yuanzhi; Wang, Xi; Lathuilière, Stéphane; Kalogeiton, Vicky

Search results>Research output from Portail HAL de Télécom Paris

conference paper

oai:HAL:hal-05566623v1

Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Authors: Yuanzhi Zhu
Xi Wang
Stéphane Lathuilière
Vicky Kalogeiton
Publication date: 23 April 2026
Publisher: 'Centre pour la Communication Scientifique Directe (CCSD)'

Abstract

International audienceOne-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO

Similar works

Full text

Portail HAL de Télécom Paris

oai:HAL:hal-05566623v1

Last time updated on 27/03/2026

This paper was published in Portail HAL de Télécom Paris.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: info:eu-repo/semantics/OpenAccess