Fashion illustration is used by designers to communicate their vision and to
bring the design idea from conceptualization to realization, showing how
clothes interact with the human body. In this context, computer vision can thus
be used to improve the fashion design process. Differently from previous works
that mainly focused on the virtual try-on of garments, we propose the task of
multimodal-conditioned fashion image editing, guiding the generation of
human-centric fashion images by following multimodal prompts, such as text,
human body poses, and garment sketches. We tackle this problem by proposing a
new architecture based on latent diffusion models, an approach that has not
been used before in the fashion domain. Given the lack of existing datasets
suitable for the task, we also extend two existing fashion datasets, namely
Dress Code and VITON-HD, with multimodal annotations collected in a
semi-automatic manner. Experimental results on these new datasets demonstrate
the effectiveness of our proposal, both in terms of realism and coherence with
the given multimodal inputs. Source code and collected multimodal annotations
will be publicly released at:
https://github.com/aimagelab/multimodal-garment-designer