39 research outputs found
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models
Generative models have recently exhibited exceptional capabilities in
text-to-image generation, but still struggle to generate image sequences
coherently. In this work, we focus on a novel, yet challenging task of
generating a coherent image sequence based on a given storyline, denoted as
open-ended visual storytelling. We make the following three contributions: (i)
to fulfill the task of visual storytelling, we propose a learning-based
auto-regressive image generation model, termed as StoryGen, with a novel
vision-language context module, that enables to generate the current frame by
conditioning on the corresponding text prompt and preceding image-caption
pairs; (ii) to address the data shortage of visual storytelling, we collect
paired image-text sequences by sourcing from online videos and open-source
E-books, establishing processing pipeline for constructing a large-scale
dataset with diverse characters, storylines, and artistic styles, named
StorySalon; (iii) Quantitative experiments and human evaluations have validated
the superiority of our StoryGen, where we show StoryGen can generalize to
unseen characters without any optimization, and generate image sequences with
coherent content and consistent character. Code, dataset, and models are
available at https://haoningwu3639.github.io/StoryGen_Webpage/Comment: Accepted by CVPR 2024. Project Page:
https://haoningwu3639.github.io/StoryGen_Webpage
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Text-to-image diffusion models are typically trained to optimize the
log-likelihood objective, which presents challenges in meeting specific
requirements for downstream tasks, such as image aesthetics and image-text
alignment. Recent research addresses this issue by refining the diffusion U-Net
using human rewards through reinforcement learning or direct backpropagation.
However, many of them overlook the importance of the text encoder, which is
typically pretrained and fixed during training. In this paper, we demonstrate
that by finetuning the text encoder through reinforcement learning, we can
enhance the text-image alignment of the results, thereby improving the visual
quality. Our primary motivation comes from the observation that the current
text encoder is suboptimal, often requiring careful prompt adjustment. While
fine-tuning the U-Net can partially improve performance, it remains suffering
from the suboptimal text encoder. Therefore, we propose to use reinforcement
learning with low-rank adaptation to finetune the text encoder based on
task-specific rewards, referred as \textbf{TexForce}. We first show that
finetuning the text encoder can improve the performance of diffusion models.
Then, we illustrate that TexForce can be simply combined with existing U-Net
finetuned models to get much better results without additional training.
Finally, we showcase the adaptability of our method in diverse applications,
including the generation of high-quality face and hand images
Q-Refine: A Perceptual Quality Refiner for AI-Generated Image
With the rapid evolution of the Text-to-Image (T2I) model in recent years,
their unsatisfactory generation result has become a challenge. However,
uniformly refining AI-Generated Images (AIGIs) of different qualities not only
limited optimization capabilities for low-quality AIGIs but also brought
negative optimization to high-quality AIGIs. To address this issue, a
quality-award refiner named Q-Refine is proposed. Based on the preference of
the Human Visual System (HVS), Q-Refine uses the Image Quality Assessment (IQA)
metric to guide the refining process for the first time, and modify images of
different qualities through three adaptive pipelines. Experimental shows that
for mainstream T2I models, Q-Refine can perform effective optimization to AIGIs
of different qualities. It can be a general refiner to optimize AIGIs from both
fidelity and aesthetic quality levels, thus expanding the application of the
T2I generation models.Comment: 6 pages, 5 figure
Exploring the Naturalness of AI-Generated Images
The proliferation of Artificial Intelligence-Generated Images (AGIs) has
greatly expanded the Image Naturalness Assessment (INA) problem. Different from
early definitions that mainly focus on tone-mapped images with limited
distortions (e.g., exposure, contrast, and color reproduction), INA on
AI-generated images is especially challenging as it has more diverse contents
and could be affected by factors from multiple perspectives, including
low-level technical distortions and high-level rationality distortions. In this
paper, we take the first step to benchmark and assess the visual naturalness of
AI-generated images. First, we construct the AI-Generated Image Naturalness
(AGIN) database by conducting a large-scale subjective study to collect human
opinions on the overall naturalness as well as perceptions from technical and
rationality perspectives. AGIN verifies that naturalness is universally and
disparately affected by technical and rationality distortions. Second, we
propose the Joint Objective Image Naturalness evaluaTor (JOINT), to
automatically predict the naturalness of AGIs that aligns human ratings.
Specifically, JOINT imitates human reasoning in naturalness evaluation by
jointly learning both technical and rationality features. We demonstrate that
JOINT significantly outperforms baselines for providing more subjectively
consistent results on naturalness assessment.Comment: 33 page
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess preliminary low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs. Project Page:
https://vqassessment.github.io/Q-Bench.Comment: 25 pages, 14 figures, 9 tables, preprint versio
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
https://q-future.github.io/Q-Instruct.Comment: 16 pages, 11 figures, page 12-16 as appendi
NTIRE 2023 Quality Assessment of Video Enhancement Challenge
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance