32 research outputs found
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Text-to-image diffusion models are typically trained to optimize the
log-likelihood objective, which presents challenges in meeting specific
requirements for downstream tasks, such as image aesthetics and image-text
alignment. Recent research addresses this issue by refining the diffusion U-Net
using human rewards through reinforcement learning or direct backpropagation.
However, many of them overlook the importance of the text encoder, which is
typically pretrained and fixed during training. In this paper, we demonstrate
that by finetuning the text encoder through reinforcement learning, we can
enhance the text-image alignment of the results, thereby improving the visual
quality. Our primary motivation comes from the observation that the current
text encoder is suboptimal, often requiring careful prompt adjustment. While
fine-tuning the U-Net can partially improve performance, it remains suffering
from the suboptimal text encoder. Therefore, we propose to use reinforcement
learning with low-rank adaptation to finetune the text encoder based on
task-specific rewards, referred as \textbf{TexForce}. We first show that
finetuning the text encoder can improve the performance of diffusion models.
Then, we illustrate that TexForce can be simply combined with existing U-Net
finetuned models to get much better results without additional training.
Finally, we showcase the adaptability of our method in diverse applications,
including the generation of high-quality face and hand images
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess preliminary low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs. Project Page:
https://vqassessment.github.io/Q-Bench.Comment: 25 pages, 14 figures, 9 tables, preprint versio
Can Large Language Models Understand Real-World Complex Instructions?
Large language models (LLMs) can understand human instructions, showing their
potential for pragmatic applications beyond traditional NLP tasks. However,
they still struggle with complex instructions, which can be either complex task
descriptions that require multiple tasks and constraints, or complex input that
contains long context, noise, heterogeneous information and multi-turn format.
Due to these features, LLMs often ignore semantic constraints from task
descriptions, generate incorrect formats, violate length or sample count
constraints, and be unfaithful to the input text. Existing benchmarks are
insufficient to assess LLMs' ability to understand complex instructions, as
they are close-ended and simple. To bridge this gap, we propose CELLO, a
benchmark for evaluating LLMs' ability to follow complex instructions
systematically. We design eight features for complex instructions and construct
a comprehensive evaluation dataset from real-world scenarios. We also establish
four criteria and develop corresponding metrics, as current ones are
inadequate, biased or too strict and coarse-grained. We compare the performance
of representative Chinese-oriented and English-oriented models in following
complex instructions through extensive experiments. Resources of CELLO are
publicly available at https://github.com/Abbey4799/CELLO
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
https://q-future.github.io/Q-Instruct.Comment: 16 pages, 11 figures, page 12-16 as appendi
Sciences for The 2.5-meter Wide Field Survey Telescope (WFST)
The Wide Field Survey Telescope (WFST) is a dedicated photometric survey
facility under construction jointly by the University of Science and Technology
of China and Purple Mountain Observatory. It is equipped with a primary mirror
of 2.5m in diameter, an active optical system, and a mosaic CCD camera of 0.73
Gpix on the main focus plane to achieve high-quality imaging over a field of
view of 6.5 square degrees. The installation of WFST in the Lenghu observing
site is planned to happen in the summer of 2023, and the operation is scheduled
to commence within three months afterward. WFST will scan the northern sky in
four optical bands (u, g, r, and i) at cadences from hourly/daily to
semi-weekly in the deep high-cadence survey (DHS) and the wide field survey
(WFS) programs, respectively. WFS reaches a depth of 22.27, 23.32, 22.84, and
22.31 in AB magnitudes in a nominal 30-second exposure in the four bands during
a photometric night, respectively, enabling us to search tremendous amount of
transients in the low-z universe and systematically investigate the variability
of Galactic and extragalactic objects. Intranight 90s exposures as deep as 23
and 24 mag in u and g bands via DHS provide a unique opportunity to facilitate
explorations of energetic transients in demand for high sensitivity, including
the electromagnetic counterparts of gravitational-wave events detected by the
second/third-generation GW detectors, supernovae within a few hours of their
explosions, tidal disruption events and luminous fast optical transients even
beyond a redshift of 1. Meanwhile, the final 6-year co-added images,
anticipated to reach g about 25.5 mag in WFS or even deeper by 1.5 mag in DHS,
will be of significant value to general Galactic and extragalactic sciences.
The highly uniform legacy surveys of WFST will also serve as an indispensable
complement to those of LSST which monitors the southern sky.Comment: 46 pages, submitted to SCMP