204 research outputs found
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers
Diffusion-model-based text-guided image generation has recently made
astounding progress, producing fascinating results in open-domain image
manipulation tasks. Few models, however, currently have complete zero-shot
capabilities for both global and local image editing due to the complexity and
diversity of image manipulation tasks. In this work, we propose a method with a
mixture-of-expert (MOE) controllers to align the text-guided capacity of
diffusion models with different kinds of human instructions, enabling our model
to handle various open-domain image manipulation tasks with natural language
instructions. First, we use large language models (ChatGPT) and conditional
image synthesis models (ControlNet) to generate a large number of global image
transfer dataset in addition to the instruction-based local image editing
dataset. Then, using an MOE technique and task-specific adaptation training on
a large-scale dataset, our conditional diffusion model can edit images globally
and locally. Extensive experiments demonstrate that our approach performs
surprisingly well on various image manipulation tasks when dealing with
open-domain images and arbitrary human instructions. Please refer to our
project page: [https://oppo-mente-lab.github.io/moe_controller/]Comment: 5 pages,6 figure
Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
Recent progress in personalized image generation using diffusion models has
been significant. However, development in the area of open-domain and
non-fine-tuning personalized image generation is proceeding rather slowly. In
this paper, we propose Subject-Diffusion, a novel open-domain personalized
image generation model that, in addition to not requiring test-time
fine-tuning, also only requires a single reference image to support
personalized generation of single- or multi-subject in any domain. Firstly, we
construct an automatic data labeling tool and use the LAION-Aesthetics dataset
to construct a large-scale dataset consisting of 76M images and their
corresponding subject detection bounding boxes, segmentation masks and text
descriptions. Secondly, we design a new unified framework that combines text
and image semantics by incorporating coarse location and fine-grained reference
image control to maximize subject fidelity and generalization. Furthermore, we
also adopt an attention control mechanism to support multi-subject generation.
Extensive qualitative and quantitative results demonstrate that our method
outperforms other SOTA frameworks in single, multiple, and human customized
image generation. Please refer to our
\href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}Comment: 14 pages, 10 figure
NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall
Strictly serializable datastores greatly simplify the development of correct
applications by providing strong consistency guarantees. However, existing
techniques pay unnecessary costs for naturally consistent transactions, which
arrive at servers in an order that is already strictly serializable. We find
these transactions are prevalent in datacenter workloads. We exploit this
natural arrival order by executing transaction requests with minimal costs
while optimistically assuming they are naturally consistent, and then leverage
a timestamp-based technique to efficiently verify if the execution is indeed
consistent. In the process of designing such a timestamp-based technique, we
identify a fundamental pitfall in relying on timestamps to provide strict
serializability, and name it the timestamp-inversion pitfall. We find
timestamp-inversion has affected several existing works.
We present Natural Concurrency Control (NCC), a new concurrency control
technique that guarantees strict serializability and ensures minimal costs --
i.e., one-round latency, lock-free, and non-blocking execution -- in the best
(and common) case by leveraging natural consistency. NCC is enabled by three
key components: non-blocking execution, decoupled response control, and
timestamp-based consistency check. NCC avoids timestamp-inversion with a new
technique: response timing control, and proposes two optimization techniques,
asynchrony-aware timestamps and smart retry, to reduce false aborts. Moreover,
NCC designs a specialized protocol for read-only transactions, which is the
first to achieve the optimal best-case performance while ensuring strict
serializability, without relying on synchronized clocks. Our evaluation shows
that NCC outperforms state-of-the-art solutions by an order of magnitude on
many workloads
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-DiffusionComment: 17 pages, 13 figure
GammaE: Gamma Embeddings for Logical Queries on Knowledge Graphs
Embedding knowledge graphs (KGs) for multi-hop logical reasoning is a
challenging problem due to massive and complicated structures in many KGs.
Recently, many promising works projected entities and queries into a geometric
space to efficiently find answers. However, it remains challenging to model the
negation and union operator. The negation operator has no strict boundaries,
which generates overlapped embeddings and leads to obtaining ambiguous answers.
An additional limitation is that the union operator is non-closure, which
undermines the model to handle a series of union operators. To address these
problems, we propose a novel probabilistic embedding model, namely Gamma
Embeddings (GammaE), for encoding entities and queries to answer different
types of FOL queries on KGs. We utilize the linear property and strong boundary
support of the Gamma distribution to capture more features of entities and
queries, which dramatically reduces model uncertainty. Furthermore, GammaE
implements the Gamma mixture method to design the closed union operator. The
performance of GammaE is validated on three large logical query datasets.
Experimental results show that GammaE significantly outperforms
state-of-the-art models on public benchmarks
CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout
Recent advances have shown promise in merging neural radiance fields (NeRFs)
with pre-trained diffusion models for text-to-3D object generation. However,
one enduring challenge is their inadequate capability to accurately parse and
regenerate consistent multi-object environments. Specifically, these models
encounter difficulties in accurately representing quantity and style prompted
by multi-object texts, often resulting in a collapse of the rendering fidelity
that fails to match the semantic intricacies. Moreover, amalgamating these
elements into a coherent 3D scene is a substantial challenge, stemming from
generic distribution inherent in diffusion models. To tackle the issue of
'guidance collapse' and enhance consistency, we propose a novel framework,
dubbed CompoNeRF, by integrating an editable 3D scene layout with object
specific and scene-wide guidance mechanisms. It initiates by interpreting a
complex text into an editable 3D layout populated with multiple NeRFs, each
paired with a corresponding subtext prompt for precise object depiction. Next,
a tailored composition module seamlessly blends these NeRFs, promoting
consistency, while the dual-level text guidance reduces ambiguity and boosts
accuracy. Noticeably, the unique modularity of CompoNeRF permits NeRF
decomposition. This enables flexible scene editing and recomposition into new
scenes based on the edited layout or text prompts. Utilizing the open source
Stable Diffusion model, CompoNeRF not only generates scenes with high fidelity
but also paves the way for innovative multi-object composition using editable
3D layouts. Remarkably, our framework achieves up to a 54\% improvement in
performance, as measured by the multi-view CLIP score metric. Code is available
at https://github.com/hbai98/Componerf
Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models
Prompt engineering is an essential technique for enhancing the abilities of
large language models (LLMs) by providing explicit and specific instructions.
It enables LLMs to excel in various tasks, such as arithmetic reasoning,
question answering, summarization, relation extraction, machine translation,
and sentiment analysis. Researchers have been actively exploring different
prompt engineering strategies, such as Chain of Thought (CoT), Zero-CoT, and
In-context learning. However, an unresolved problem arises from the fact that
current approaches lack a solid theoretical foundation for determining optimal
prompts. To address this issue in prompt engineering, we propose a new and
effective approach called Prompt Space. Our methodology utilizes text
embeddings to obtain basis vectors by matrix decomposition, and then constructs
a space for representing all prompts. Prompt Space significantly outperforms
state-of-the-art prompt paradigms on ten public reasoning benchmarks. Notably,
without the help of the CoT method and the prompt "Let's think step by step",
Prompt Space shows superior performance over the few-shot method. Overall, our
approach provides a robust and fundamental theoretical framework for selecting
simple and effective prompts. This advancement marks a significant step towards
improving prompt engineering for a wide variety of applications in LLMs.Comment: Natural language processing (NLP
- …