Previous virtual try-on methods usually focus on aligning a clothing item
with a person, limiting their ability to exploit the complex pose, shape and
skin color of the person, as well as the overall structure of the clothing,
which is vital to photo-realistic virtual try-on. To address this potential
weakness, we propose a fill in fabrics (FIFA) model, a self-supervised
conditional generative adversarial network based framework comprised of a
Fabricator and a unified virtual try-on pipeline with a Segmenter, Warper and
Fuser. The Fabricator aims to reconstruct the clothing image when provided with
a masked clothing as input, and learns the overall structure of the clothing by
filling in fabrics. A virtual try-on pipeline is then trained by transferring
the learned representations from the Fabricator to Warper in an effort to warp
and refine the target clothing. We also propose to use a multi-scale structural
constraint to enforce global context at multiple scales while warping the
target clothing to better fit the pose and shape of the person. Extensive
experiments demonstrate that our FIFA model achieves state-of-the-art results
on the standard VITON dataset for virtual try-on of clothing items, and is
shown to be effective at handling complex poses and retaining the texture and
embroidery of the clothing