806,088 research outputs found
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Diffusion models, which have emerged to become popular text-to-image
generation models, can produce high-quality and content-rich images guided by
textual prompts. However, there are limitations to semantic understanding and
commonsense reasoning in existing models when the input prompts are concise
narrative, resulting in low-quality image generation. To improve the capacities
for narrative prompts, we propose a simple-yet-effective parameter-efficient
fine-tuning approach called the Semantic Understanding and Reasoning adapter
(SUR-adapter) for pre-trained diffusion models. To reach this goal, we first
collect and annotate a new dataset SURD which consists of more than 57,000
semantically corrected multi-modal samples. Each sample contains a simple
narrative prompt, a complex keyword-based prompt, and a high-quality image.
Then, we align the semantic representation of narrative prompts to the complex
prompts and transfer knowledge of large language models (LLMs) to our
SUR-adapter via knowledge distillation so that it can acquire the powerful
semantic understanding and reasoning capabilities to build a high-quality
textual semantic representation for text-to-image generation. We conduct
experiments by integrating multiple LLMs and popular pre-trained diffusion
models to show the effectiveness of our approach in enabling diffusion models
to understand and reason concise natural language without image quality
degradation. Our approach can make text-to-image diffusion models easier to use
with better user experience, which demonstrates our approach has the potential
for further advancing the development of user-friendly text-to-image generation
models by bridging the semantic gap between simple narrative prompts and
complex keyword-based prompts. The code is released at
https://github.com/Qrange-group/SUR-adapter.Comment: accepted by ACM MM 202
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
Recent advances in text-to-image generation with diffusion models present
transformative capabilities in image quality. However, user controllability of
the generated image, and fast adaptation to new tasks still remains an open
challenge, currently mostly addressed by costly and long re-training and
fine-tuning or ad-hoc adaptations to specific image generation tasks. In this
work, we present MultiDiffusion, a unified framework that enables versatile and
controllable image generation, using a pre-trained text-to-image diffusion
model, without any further training or finetuning. At the center of our
approach is a new generation process, based on an optimization task that binds
together multiple diffusion generation processes with a shared set of
parameters or constraints. We show that MultiDiffusion can be readily applied
to generate high quality and diverse images that adhere to user-provided
controls, such as desired aspect ratio (e.g., panorama), and spatial guiding
signals, ranging from tight segmentation masks to bounding boxes. Project
webpage: https://multidiffusion.github.i
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
Despite the stunning ability to generate high-quality images by recent
text-to-image models, current approaches often struggle to effectively compose
objects with different attributes and relationships into a complex and coherent
scene. We propose T2I-CompBench, a comprehensive benchmark for open-world
compositional text-to-image generation, consisting of 6,000 compositional text
prompts from 3 categories (attribute binding, object relationships, and complex
compositions) and 6 sub-categories (color binding, shape binding, texture
binding, spatial relationships, non-spatial relationships, and complex
compositions). We further propose several evaluation metrics specifically
designed to evaluate compositional text-to-image generation. We introduce a new
approach, Generative mOdel fine-tuning with Reward-driven Sample selection
(GORS), to boost the compositional text-to-image generation abilities of
pretrained text-to-image models. Extensive experiments and evaluations are
conducted to benchmark previous methods on T2I-CompBench, and to validate the
effectiveness of our proposed evaluation metrics and GORS approach. Project
page is available at https://karine-h.github.io/T2I-CompBench/.Comment: Project page: https://karine-h.github.io/T2I-CompBench
Conditional Generation from Unconditional Diffusion Models using Denoiser Representations
Denoising diffusion models have gained popularity as a generative modeling
technique for producing high-quality and diverse images. Applying these models
to downstream tasks requires conditioning, which can take the form of text,
class labels, or other forms of guidance. However, providing conditioning
information to these models can be challenging, particularly when annotations
are scarce or imprecise. In this paper, we propose adapting pre-trained
unconditional diffusion models to new conditions using the learned internal
representations of the denoiser network. We demonstrate the effectiveness of
our approach on various conditional generation tasks, including
attribute-conditioned generation and mask-conditioned generation. Additionally,
we show that augmenting the Tiny ImageNet training set with synthetic images
generated by our approach improves the classification accuracy of ResNet
baselines by up to 8%. Our approach provides a powerful and flexible way to
adapt diffusion models to new conditions and generate high-quality augmented
data for various conditional generation tasks
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
The revolution of artificial intelligence content generation has been rapidly
accelerated with the booming text-to-image (T2I) diffusion models. Within just
two years of development, it was unprecedentedly of high-quality, diversity,
and creativity that the state-of-the-art models could generate. However, a
prevalent limitation persists in the effective communication with these popular
T2I models, such as Stable Diffusion, using natural language descriptions. This
typically makes an engaging image hard to obtain without expertise in prompt
engineering with complex word compositions, magic tags, and annotations.
Inspired by the recently released DALLE3 - a T2I model directly built-in
ChatGPT that talks human language, we revisit the existing T2I systems
endeavoring to align human intent and introduce a new task - interactive text
to image (iT2I), where people can interact with LLM for interleaved
high-quality image generation/edit/refinement and question answering with
stronger images and text correspondences using natural language. In addressing
the iT2I problem, we present a simple approach that augments LLMs for iT2I with
prompting techniques and off-the-shelf T2I models. We evaluate our approach for
iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT,
LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a
convenient and low-cost way to introduce the iT2I ability for any existing LLMs
and any text-to-image models without any training while bringing little
degradation on LLMs' inherent capabilities in, e.g., question answering and
code generation. We hope this work could draw broader attention and provide
inspiration for boosting user experience in human-machine interactions
alongside the image quality of the next-generation T2I systems.Comment: Technical report. Project page at https://minidalle3.github.io
Copy mechanism and tailored training for character-based data-to-text generation
In the last few years, many different methods have been focusing on using
deep recurrent neural networks for natural language generation. The most widely
used sequence-to-sequence neural methods are word-based: as such, they need a
pre-processing step called delexicalization (conversely, relexicalization) to
deal with uncommon or unknown words. These forms of processing, however, give
rise to models that depend on the vocabulary used and are not completely
neural.
In this work, we present an end-to-end sequence-to-sequence model with
attention mechanism which reads and generates at a character level, no longer
requiring delexicalization, tokenization, nor even lowercasing. Moreover, since
characters constitute the common "building blocks" of every text, it also
allows a more general approach to text generation, enabling the possibility to
exploit transfer learning for training. These skills are obtained thanks to two
major features: (i) the possibility to alternate between the standard
generation mechanism and a copy one, which allows to directly copy input facts
to produce outputs, and (ii) the use of an original training pipeline that
further improves the quality of the generated texts.
We also introduce a new dataset called E2E+, designed to highlight the
copying capabilities of character-based models, that is a modified version of
the well-known E2E dataset used in the E2E Challenge. We tested our model
according to five broadly accepted metrics (including the widely used BLEU),
showing that it yields competitive performance with respect to both
character-based and word-based approaches.Comment: ECML-PKDD 2019 (Camera ready version
Quark Stars as inner engines for Gamma Ray Bursts?
A model for Gamma ray bursts inner engine based on quark stars (speculated to
exist in nature) is presented. We describe how and why these objects might
constitute new candidates for GRB inner engines. At the heart of the model is
the onset of exotic phases of quark matter at the surface of such stars, in
particular the 2-flavor color superconductivity. A novel feature of such a
phase is the generation of particles which are unstable to photon decay
providing a natural mechanism for a fireball generation; an approach which is
fundamentally different from models where the fireball is generated during
collapse or conversion of neutron star to quark star processes. The model is
capable of reproducing crucial features of Gamma ray bursts, such as the
episodic activity of the engine (multiple and random shell emission) and the
two distinct categories of the bursts (two regimes are isolated in the model
with \sim 2 s and \sim 81 s burst total duration).Comment: 8 pages, 3 figures, new and more appropriate title. Major changes in
the text (aspects of the models discussed in more details), better quality
Figure 1 and Figure 2 and added Figure 3, version to appear in
Astronomy&Astrophysic
- …