Full-body High-resolution Anime Generation with Progressive Structure-conditional Generative Adversarial Networks
Friday, May 11, 2018
We propose Progressive Structure-conditional Generative Adversarial Networks (PSGAN), a new framework that can generate full-body and high-resolution character images based on pose information.
Recent progress in generative adversarial networks (GANs) with hierarchical and progressive structures have made possible the generation of high-resolution images. However, existing approaches have limitations in generating structural objects, such as full-body characters, which is important for industrial applications. On the other hand, although GANs that can generate images based on structured conditions (e. g. poses and facial landmarks) have been proposed, their image quality is insufficient. To tackle the limitations described above, we introduce a PSGAN which progressively increases resolution of generated images with structural conditions at each scale during training to generate detailed images of structured objects, such as full-body characters. Also, we impose arbitrary latent variables and structure conditions on the network to enable generation of diverse and controllable videos based on the target pose sequences.
In this report we empirically demonstrate the effectiveness of our approach showing experimental results of video generation at 512x512 with detailed and pose-conditioned anime characters.
Overview of generated results
We show examples of a variety of anime characters and animations generated by PSGAN. We firstly generate many anime characters from random latent variable using PSGAN. Next we generate new anime characters by interpolating latent values corresponding to anime characters. Then, animation of the interpolated anime character is generated with continuous pose sequences.
Generation of a new full-body anime character
We generate new full-body anime characters by interpolating latent values corresponding to anime characters with different costumes (character 1 and 2) with PSGAN. Note that only a single pose condition is imposed here.
Adding action to the generated anime character
The following shows examples of animation generation with the specified anime characters and target pose.
By fixing the latent variables and giving continuous pose sequences to PSGAN, we can generate the animation of the character. More specifically, we map the representation of the specified anime character into the latent variables in the latent space which serve as the input vector of PSGAN.
By mapping the specified anime characters to latent space and generating the latent variables as the input of PSGAN, arbitrary animation with the specified anime characters is generated.
Recently automatic image and video generation using deep generative models have been studied ([Goodfellow+14], [Karras+18], [Vondrick+16]) . These are useful for media creation tools such as photo editing, animation production and movie making.
Focusing on anime creation, automatic character generation can inspire experts to create new characters, and also can contribute to reducing cost for drawing animation.
[Jin+17] focuses on image generation for anime character faces with GAN architecture. However full-body character generation has not been proposed yet.
Generation of images for anime characters which only focused on face images was proposed, however, its quality did not satisfy one which is required for making animation ([Jin+17]).
To generate full-body characters automatically and add action to them with high-quality is a great help for making new characters and drawing animations. Therefore, we work on generating full-body character images and adding action to them (i.e., video generation) with high quality.
There remains two problems to apply full-body character generation to animation production; (i) generation with high-resolution, (ii) generation with specified sequence of poses.
Generative Adversarial Networks (GANs) ([Goodfellow+14]) are one of the most promising candidates as a framework applied to a diverse range of image generation tasks ([Radford+16], [Reed+16], [Isola+16], [Zhu+17], [Ma+17a], [Jin+17]). Recent progress of GANs with hierarchical and progressive structures has been realizing high-resolution and detailed image synthesis ([Karras+18]) and text-to-image generation ([Zhang+18], [Zhang+17]). However, applications of high-quality generations are still restricted to some objects, such as faces and birds. It is a challenge for GANs to generate structural objects with global structures ([Goodfellow17]), and it is the same for generation with high-resolution. On the other hand, GANs with structured conditions, such as poses and facial landmarks, have been also proposed ([Ma+17a], [Ma+17b], [Qiao+18]). However, their image quality is insufficient.
We propose Progressive Structure-conditional GAN (PSGAN) to tackle these problems. We show that PSGAN is able to generate full body anime characters and animations with target pose sequences at 512x512 resolution. As PSGAN generates images with latent variable and structure conditions, the PSGAN is able to generate controllable animations with target pose sequences.
Progressive Structure-conditional GAN
Our key idea is to learn image representation with structural conditions progressively. PSGAN increases the resolution of generated images with structural conditions at each scale and generates high-resolution images with detailed pose conditions. We adopt the same architecture of the image generator and discriminator as [Karras+18], except that we impose structural conditions on both the generator and discriminator at each scale by adding pose maps with corresponding resolutions.
With the proposed network architecture, image generation is carried out from low-resolution layers to high-resolution layers progressively with corresponding condition maps, which significantly stabilizes training. This addition makes the training of the generator and discriminator structure-conditional for each NxN resolution progressively and stabilizes training for structure-conditional generation.
Generator(G) and Discriminator(D) architecture of PSGAN. NxN white boxes stand for learnable convolution layers operating on NxN spatial resolution. NxN gray boxes stand for non-learnable downsampling layers for structure conditions, which reduce spatial resolution of the structural condition map to NxN.
Training data preparation
In this section, we describe our dataset preparation methodology. For PSGAN we require pairs of image and keypoint coordinates. We prepare the original avatar anime-character dataset synthesized by Unity, and DeepFashion dataset with keypoints detected by Openpose.
Avatar Anime-Character Dataset
We construct the new dataset for PSGAN fulfilling three requirements:
- Pose diversity. To generate smooth and natural animation we prepare a very wide variety of pose conditions.
- Number of training images. An infinite number of synthetic images with keypoint maps are obtained by generating 3D modeled avatars using Unity, without any manual annotations.
- Background elimination. We set the background color to white and erase unnecessary information to avoid negative effects on image generation.
We divide several continuous actions of one avatar into 600 poses, and capture keypoints in each pose. We conducted such process for 79 kinds of costumes, and obtained 47,400 images in total. We also obtained 20 keypoints based on the location of the bones of the 3D model.
The following figure shows samples of training data. Anime characters (top row) and pose images (bottom row).
DeepFashion Dataset PSGAN exploits pose information to impose structural conditions on the image generation network. We use Openpose [Cao+16] to extract keypoint coordinates from the images that have no keypoint annotations. The number of keypoints is 18 and examples with less than 10 detected keypoints are omitted. The missing keypoints were filled with -1 and other keypoints were set to 1.
Training settings for experiments
We use the same stage design and the same loss function as [Karras+18]. We have shown the discriminator 600k real images and structure conditions for each stage and use WGAN-GP loss ([Gulrajani+17]) with n_critic = 1. We use a minibatch size 16 at the stage for 4x4 - 128x128 image generation and gradually decrease it to 12 for 256x256 images and 5 for 512x512 images respectively to save GPU memory.
We use M channels for M keypoints as representation of structural conditions. At each channel, one pixel is filled with 1 on the corresponding keypoint and -1 elsewhere. We use max-pooling with kernel size 2 and stride 2 for each NxN resolution as reduction layers for structure-condition.
Avatar Anime-Character Dataset: We train the networks using Adam [Kingma+15] with β1 =0, β2 =0.99. We use α =0.001 at the stage for 4x4 - 64x64 image generation and gradually decrease it to α =0.0008 for 128x128 images, α =0.0006 for 256x256 images, and α =0.0002 for 512x512 images respectively. The number of pose keypoints is 20.
DeepFashion Dataset: We train the networks using Adam with α =0.0008, β1 =0, β2 =0.99 for all stages. The number of pose channels is 18.
Comparison of PSGAN, PG2, Dinentange PG2, and Progressive GAN
Here we investigate the diversity of generated images of PSGAN. The following figure shows generated images of PSGAN where the latent variables are randomly set. PSGAN generates a wide range of images for each pose condition.
Next, we evaluate reproducibility of PSGAN compared to Pose Guided Person Image Generation (PG2)[Ma+17a] and Disentangled Person Image Generation (DPG2). PG2 and DPG2 require the source image and the corresponding target pose to convert the source image to an image with the structure of the target pose. Meanwhile, the PSGAN generates an image with the structure of the target pose from latent variables and the target pose. PG2 and DPG2 are more strongly conditioned by the source image and the corresponding target pose compared to PSGAN.
The following figure shows generated images of PSGAN, PG2 and DPG2. The input image of PG2 and DPG2 are omitted. We can observe that images generated by PSGAN are as natural and plausible as PG2 and DPG2, representing the imposed pose condition. Since PSGAN also generates from the latent variables, PSGAN can generate a variety of images compared to PG2 and DPG2 in principle.
Finally, we evaluate structural consistency of PSGAN compared to Progressive GAN ([Karras+18]). The following figure is a comparison of images generated by Progressive GAN and PSGAN. We observe that Progressive GAN is not capable of generating natural images of structural objects consisting with their global structures (for example, left two images). On the other hand, PSGAN can generate plausible images consisting with their global structures by imposing the structural conditions at each scale.
This report has demonstrated smooth and high-resolution animation generation with PSGAN. We have shown that PSGAN can generate full-body anime characters and their animations based on target pose sequences at 512x512.
PSGAN progressively increases resolution of generated images with structural conditions at each scale during training and generates detailed images for structured objects, such as full-body characters. As PSGAN generates images with latent vectors and structure conditions, PSGAN is able to generate controllable animations with target pose sequences.
Our experimental results demonstrate that the PSGAN can generate a variety of anime characters from random latent variables, and smooth animations by imposing continuous pose sequences as structural conditions. Since the experimental setting still remains limited, such as one avatar and several actions, we plan to conduct experiments and evaluation in various conditions.
In the future, we plan to open Avatar Anime-Character Dataset.
[Goodfellow+14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In NIPS 2014.
[Karras+18] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for Improved Quality, and Stability, and Variation. In ICLR, 2018.
[Vondrick+16] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In NIPS, 2016.
[Radford+16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In ICLR, 2016.
[Reed+16] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative Adversarial Text to Image Synthesis. In ICML, 2017.
[Isola+17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR, 2017.
[Zhu+17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV, 2017.
[Ma+17a] Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose Guided Person Image Generation. In NIPS, 2017.
[Jin+17] Yanghua Jin, Jiakai Zhang, Minjun Li, Yingtao Tian, and Huachun Zhu. Towards the High-quality Anime Characters Generation with Generative Adversarial Networks. In Machine Learning for Creativity and Design, NIPS 2017 Workshop, 2017.
[Zhang+18] Zizhao Zhang, Yuanpu Xie, and Lin Yang, Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network. In CVPR, 2018.
[Zhang+17] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. CoRR, abs/1710.10916, 2017.
[Ma+17b] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Disentangled Person Image Generation. CoRR, abs/1712.02621, 2017.
[Qiao+18] Fengchun Qiao, Naiming Yao, Zirui Jiao, Zhihao Li, Hui Chen, and Hongan Wang. Geometry-Contrastive Generative Adversarial Network for Facial Expression Synthesis. CoRR, abs/1802.01822, 2018.
[Goodfellow17] Ian Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. CoRR, abs/1701.00160, 2017.
[Gulrajani+17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved Training of Wasserstein GANs. In NIPS,2017.
[Kingma+15] Diederik P. Kingma, and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
[Unity] Unity, https://unity3d.com
[Cao+16] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR, 2016.