UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation

1Shanghai AI Laboratory, 2S-Lab, Nanyang Technological University, 3CUHK


Human generation has achieved significant progress. Nonetheless, existing methods still struggle to synthesize specific regions such as faces and hands. We argue that the main reason is rooted in the training data. A holistic human dataset inevitably has insufficient and low-resolution information on local parts. Therefore, we propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model. However, multi-source data inherently a) contains different parts that do not spatially align into a coherent human, and b) comes with different scales. To tackle these challenges, we propose an end-to-end framework, UnitedHuman, that empowers continuous GAN with the ability to effectively utilize multi-source data for high-resolution human generation. Specifically, 1) we design a Multi-Source Spatial Transformer that spatially aligns multi-source images to full-body space with a human parametric model. 2) Next, a continuous GAN is proposed with global-structural guidance. Patches from different datasets are then sampled and transformed to supervise the training of this scale-invariant generative model. Extensive experiments demonstrate that our model jointly learned from multi-source data achieves superior quality than those learned from a holistic dataset.

Loading... Loading... Loading... Loading...

Overview Video


Image 1
Overview. Given the images Ip from the multi-source datasets Dms, the Multi-Source Spatial Transformer puts the partial-body image into the full-body image space as Ip→f for a unified spatial distribution. With sampling parameters v, s, r and latent code z from prior distribution, our Continuous GAN generates the patches Pn at center v with scale s. The patches over the full-body space are stitched to form the high-resolution full-body images If .
Image 1
Multi-source Spatial Transformer. Given a image Ip , we first predict the camera αest and pose θest of SMPL. Optimized by SMPLify-P on both visible and invisible parts, the camera αopt is used to calculate the matrix H that transforms the patch into the full-body image space.

Compare with SOTAs

Image 1 Image 1

Here are the comparison results of StyleGAN-Human, InsetGAN, AnyRes, and UnitedHuman. We exhibit the full-body human images generated from each experiment at a resolution of 1024 (), as well as the face and hand patches cut from the 2048px images ().

More generated results

Image 1 Image 1

Here are more results generated by the proposed method.

Image 1

One of the interpolation results between two latent codes.

Related Works

StyleGAN-Human proposes SHHQ full-body human datasets.

Text2Human proposes a text-driven controllable human image synthesis framework.


      title={UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation},
      author={Fu, Jianglin and Li, Shikai and Jiang, Yuming and Lin, Kwan-Yee and Wu, Wayne and Liu, Ziwei},
      journal   = {arXiv preprint},
      volume    = {arXiv:2309.14335},
      year    = {2023}