The utilization of Large Language Models (LLMs) for the construction of AI
systems has garnered significant attention across diverse fields. The extension
of LLMs to the domain of fashion holds substantial commercial potential but
also inherent challenges due to the intricate semantic interactions in
fashion-related generation. To address this issue, we developed a hierarchical
AI system called Fashion Matrix dedicated to editing photos by just talking.
This system facilitates diverse prompt-driven tasks, encompassing garment or
accessory replacement, recoloring, addition, and removal. Specifically, Fashion
Matrix employs LLM as its foundational support and engages in iterative
interactions with users. It employs a range of Semantic Segmentation Models
(e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing
masks based on user instructions. Subsequently, Visual Foundation Models (e.g.,
Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images
from text prompts and masks, thereby facilitating the automation of fashion
editing processes. Experiments demonstrate the outstanding ability of Fashion
Matrix to explores the collaborative potential of functionally diverse
pre-trained models in the domain of fashion editing.Comment: 13 pages, 5 figures, 2 table