Low-Rank Adaptations (LoRAs) have emerged as a powerful and popular technique
in the field of image generation, offering a highly effective way to adapt and
refine pre-trained deep learning models for specific tasks without the need for
comprehensive retraining. By employing pre-trained LoRA models, such as those
representing a specific cat and a particular dog, the objective is to generate
an image that faithfully embodies both animals as defined by the LoRAs.
However, the task of seamlessly blending multiple concept LoRAs to capture a
variety of concepts in one image proves to be a significant challenge. Common
approaches often fall short, primarily because the attention mechanisms within
different LoRA models overlap, leading to scenarios where one concept may be
completely ignored (e.g., omitting the dog) or where concepts are incorrectly
combined (e.g., producing an image of two cats instead of one cat and one dog).
To overcome these issues, CLoRA addresses them by updating the attention maps
of multiple LoRA models and leveraging them to create semantic masks that
facilitate the fusion of latent representations. Our method enables the
creation of composite images that truly reflect the characteristics of each
LoRA, successfully merging multiple concepts or styles. Our comprehensive
evaluations, both qualitative and quantitative, demonstrate that our approach
outperforms existing methodologies, marking a significant advancement in the
field of image generation with LoRAs. Furthermore, we share our source code,
benchmark dataset, and trained LoRA models to promote further research on this
topic