Fashion vision-language pre-training models have shown efficacy for a wide
range of downstream tasks. However, general vision-language pre-training models
pay less attention to fine-grained domain features, while these features are
important in distinguishing the specific domain tasks from general tasks. We
propose a method for fine-grained fashion vision-language pre-training based on
fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained
multi-modalities fashion attributes and characteristics. Firstly, we propose
the fashion symbols, a novel abstract fashion concept layer, to represent
different fashion items and to generalize various kinds of fine-grained fashion
features, making modelling fine-grained attributes more effective. Secondly,
the attributes prompt method is proposed to make the model learn specific
attributes of fashion items explicitly. We design proper prompt templates
according to the format of fashion data. Comprehensive experiments are
conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and
FashionSAP gets SOTA performances for four popular fashion tasks. The ablation
study also shows the proposed abstract fashion symbols, and the attribute
prompt method enables the model to acquire fine-grained semantics in the
fashion domain effectively. The obvious performance gains from FashionSAP
provide a new baseline for future fashion task research