Human generation has achieved significant progress. Nonetheless, existing
methods still struggle to synthesize specific regions such as faces and hands.
We argue that the main reason is rooted in the training data. A holistic human
dataset inevitably has insufficient and low-resolution information on local
parts. Therefore, we propose to use multi-source datasets with various
resolution images to jointly learn a high-resolution human generative model.
However, multi-source data inherently a) contains different parts that do not
spatially align into a coherent human, and b) comes with different scales. To
tackle these challenges, we propose an end-to-end framework, UnitedHuman, that
empowers continuous GAN with the ability to effectively utilize multi-source
data for high-resolution human generation. Specifically, 1) we design a
Multi-Source Spatial Transformer that spatially aligns multi-source images to
full-body space with a human parametric model. 2) Next, a continuous GAN is
proposed with global-structural guidance and CutMix consistency. Patches from
different datasets are then sampled and transformed to supervise the training
of this scale-invariant generative model. Extensive experiments demonstrate
that our model jointly learned from multi-source data achieves superior quality
than those learned from a holistic dataset.Comment: Accepted by ICCV2023. Project page: https://unitedhuman.github.io/
Github: https://github.com/UnitedHuman/UnitedHuma