Concept personalization methods enable large text-to-image models to learn
specific subjects (e.g., objects/poses/3D models) and synthesize renditions in
new contexts. Given that the image references are highly biased towards visual
attributes, state-of-the-art personalization models tend to overfit the whole
subject and cannot disentangle visual characteristics in pixel space. In this
study, we proposed a more challenging setting, namely fine-grained visual
appearance personalization. Different from existing methods, we allow users to
provide a sentence describing the desired attributes. A novel decoupled
self-augmentation strategy is proposed to generate target-related and
non-target samples to learn user-specified visual attributes. These augmented
data allow for refining the model's understanding of the target attribute while
mitigating the impact of unrelated attributes. At the inference stage,
adjustments are conducted on semantic space through the learned target and
non-target embeddings to further enhance the disentanglement of target
attributes. Extensive experiments on various kinds of visual attributes with
SOTA personalization methods show the ability of the proposed method to mimic
target visual appearance in novel contexts, thus improving the controllability
and flexibility of personalization.Comment: 14 pages, 13 figures, 2 table