3 research outputs found
Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation
The view inconsistency problem in score-distilling text-to-3D generation,
also known as the Janus problem, arises from the intrinsic bias of 2D diffusion
models, which leads to the unrealistic generation of 3D objects. In this work,
we explore score-distilling text-to-3D generation and identify the main causes
of the Janus problem. Based on these findings, we propose two approaches to
debias the score-distillation frameworks for robust text-to-3D generation. Our
first approach, called score debiasing, involves gradually increasing the
truncation value for the score estimated by 2D diffusion models throughout the
optimization process. Our second approach, called prompt debiasing, identifies
conflicting words between user prompts and view prompts utilizing a language
model and adjusts the discrepancy between view prompts and object-space camera
poses. Our experimental results show that our methods improve realism by
significantly reducing artifacts and achieve a good trade-off between
faithfulness to the 2D diffusion models and 3D consistency with little
overhead
Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
Following generative adversarial networks (GANs), a de facto standard model
for image generation, denoising diffusion models (DDMs) have been actively
researched and attracted strong attention due to their capability to generate
images with high quality and diversity. However, the way the internal
self-attention mechanism works inside the UNet of DDMs is under-explored. To
unveil them, in this paper, we first investigate the self-attention operations
within the black-boxed diffusion models and build hypotheses. Next, we verify
the hypotheses about the self-attention map by conducting frequency analysis
and testing the relationships with the generated objects. In consequence, we
find out that the attention map is closely related to the quality of generated
images. On the other hand, diffusion guidance methods based on additional
information such as labels are proposed to improve the quality of generated
images. Inspired by these methods, we present label-free guidance based on the
intermediate self-attention map that can guide existing pretrained diffusion
models to generate images with higher fidelity. In addition to the enhanced
sample quality when used alone, we show that the results are further improved
by combining our method with classifier guidance on ImageNet 128x128.Comment: Project Page: https://ku-cvlab.github.io/Self-Attention-Guidanc
Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation
In the paradigm of AI-generated content (AIGC), there has been increasing
attention in extending pre-trained text-to-image (T2I) models to text-to-video
(T2V) generation. Despite their effectiveness, these frameworks face challenges
in maintaining consistent narratives and handling rapid shifts in scene
composition or object placement from a single user prompt. This paper
introduces a new framework, dubbed DirecT2V, which leverages instruction-tuned
large language models (LLMs) to generate frame-by-frame descriptions from a
single abstract user prompt. DirecT2V utilizes LLM directors to divide user
inputs into separate prompts for each frame, enabling the inclusion of
time-varying content and facilitating consistent video generation. To maintain
temporal consistency and prevent object collapse, we propose a novel value
mapping method and dual-softmax filtering. Extensive experimental results
validate the effectiveness of the DirecT2V framework in producing visually
coherent and consistent videos from abstract user prompts, addressing the
challenges of zero-shot video generation.Comment: The code and demo will be available at
https://github.com/KU-CVLAB/DirecT2