CTH-Net: CNN-Transformer hybrid network for garment image generation from sketches and color points

doi:10.11996/JG.j.2095-302X.2023010120

Abstract

Abstract:

Drawing garment images is an important part of garment design. To address the problems such as low intelligence and high requirements for users' drawing skills and imagination, a CNN-Transformer hybrid network (CTH-Net) was proposed to generate garment images from sketches and color points. CTH-Net combined the advantages of convolutional neural networks (CNN) in extracting local information and Transformer in processing long-range dependencies, efficiently fusing the architectures of these two models. The ToPatch and ToFeatureMap modules were also designed to reduce the amount and dimension of data input into Transformer, thus reducing the consumption of computing resources. CTH-Net consisted of three phases: the first drafting phase, which aimed to predict the color distribution of garments and obtain watercolor images without gradients and shadows; the second refinement phase, which refined the watercolor image into a realistic garment image; the third tuning phase, which combined the outputs of the above two phases to further optimize the generation quality. The experimental results show that CTH-Net could generate high-quality garment images by simply inputting sketches and some color points. The proposed network could outperform the existing methods in the realism and accuracy of the generated images.

Key words: deep learning, convolutional neural network, image generation, Transformer

CLC Number:

TP391

PAN Dong-hui, JIN Ying-han, SUN Xu, LIU Yu-sheng, ZHANG Dong-liang. CTH-Net: CNN-Transformer hybrid network for garment image generation from sketches and color points[J]. Journal of Graphics, 2023, 44(1): 120-130.

Figures/Tables 16

Fig. 1 Proposed method of automatic image generation ((a) Sketches and color points; (b) Garment image)

Fig. 2 Overview of our method

Fig. 3 Structure of CNN-Transformer residual block (CTRB)

Fig. 4 Structure of Transformer module used in CTRB

Fig. 5 Pre-processing of a garment image ((a) Garment image; (b) Image normalization; (c) Watercolor image; (d) Smoothed image; (e) Sketches; (f) Color points; (g) Blurred image)

Fig. 6 Process of sketch extraction ((a) Garment image; (b) Image with noise; (c-d) Examples of training data; (e-f) Examples of target data)

Table 1 Parameter settings of training the sketch extraction model

参数名称	参数值
Optimizer	AdamW
Learning rate	0.000 1
Epoch	2 000
CPU	E3-1230v2
GPU	GTX1660s
Memory	16 G
OS	Windows 10

Fig. 7 Results of sketch extraction

Table 2 Parameter settings of training the CTH-Net

参数名称	参数值
Optimizer	AdamW
Learning rate	0.000 01
λ_∗	λ_rec:1, λ_fea:0.01, λ_adv:0.001, λ_R:0.01
Epoch	Drafting:500, Refinement:500, Tuning:50
CPU	E5-2695 v3
GPU	Tesla P100
Memory	32 G
OS	Ubuntu20.04

Fig. 8 Comparisons of generation results between CTH-Net and other methods ((a) Large area pattern; (b) Stripe pattern; (c) Solid color pattern; (d) Small area pattern; (e) Blue mickey pattern)

Fig. 9 Comparisons of generation details between CTH-Net and other networks ((a) Inputs; (b) Aaaention-UNet; (c) Pix2PixHD; (d) VQGAN; (e) CTH-Net)

Fig. 10 More comparisons of generation results between CTH-Net and other networks ((a) Inputs; (b) MUNIT; (c) UNet; (d) Pix2PixHD; (e) Attention-UNet; (f) TrandGAN; (g) VQGAN; (h) CTH-Net)

Fig. 11 Results of generated images of CTH-Net ((a) Inputs; (b) CTH-Net; (c) Inputs; (d) CTH-Net)

Table 3 Comparisons of quantitative evaluations between CTH-Net and other methods

方法	HPR	IS	FID
MUNIT	0.0	3.852	5.790
UNet	1.3	4.266	2.340
PixPixHD	1.0	4.133	2.464
Attention-UNet	1.7	4.287	2.191
TransGAN	0.8	4.174	2.419
VQGAN	1.9	4.304	2.085
FashionImageDesign	3.4	-	-
CTH-Net^† (本文)	4.4	4.427	1.872
CTH-Net (本文)	5.5	4.583	1.496

Table 4 Performance of ToPatch and ToFeatureMap

方法	Batch size^*	训练一轮时间(s)	显存占用(G)	IS	FID
有ToPatch与ToFeatureMap	4	365.2	2.223	4.583	1.496
无ToPatch与ToFeatureMap	4	1204.6	7.854	4.989	1.367

Fig. 12 Results of generated images when gradually increasing the number of CTRBs (from left to right)

References 43

[1]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//The 31st Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 5998-6008.
[2]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[3]	ISOLA P, ZHU J Y, ZHOU T H, et al. Image-to-image translation with conditional adversarial networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5967-5976.
[4]	SANGKLOY P, LU J W, FANG C, et al. Scribbler: controlling deep image synthesis with sketch and color[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6836-6845.
[5]	ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2242-2251.
[6]	WANG T C, LIU M Y, ZHU J Y, et al. High-resolution image synthesis and semantic manipulation with conditional GANs[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 8798-8807.
[7]	ZHU J Y, ZHANG R, PATHAK D, et al. Toward multimodal image-to-image translation[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 465-476.
[8]	ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 649-666.
[9]	YOU S, YOU N, PAN M X. PI-REC: progressive image reconstruction network with edge and color domain[EB/OL]. (2019-03-25) [2022-01-28]. https://arxiv.org/abs/1903.10146.
[10]	REN H, LI J, GAO N. Two-stage sketch colorization with color parsing[J]. IEEE Access, 2019, 8: 44599-44610. DOI URL
[11]	CHONG M J, FORSYTH D. JoJoGAN: one shot face stylization[EB/OL]. [2022-01-28].https://arxiv.org/abs/2112.11641.
[12]	KARRAS T, LAINE S, AILA T M. A style-based generator architecture for generative adversarial networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2019: 4396-4405.
[13]	LIN Z, ZHANG Z, ZHANG K R, et al. Interactive style transfer: all is your palette[EB/OL]. [2022-01-28]. https://arxiv.org/abs/2203.13470.
[14]	CHEN P, ZHANG Y, LI Z, et al. Few-Shot Incremental Learning for Label-to-Image Translation[C]//2022 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3697-3707.
[15]	LI Y, YU X G, HAN X G, et al. A deep learning based interactive sketching system for fashion images design[EB/OL]. (2020-10-09) [2022-01-12].https://arxiv.org/abs/2010.04413.
[16]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-12]. https://arxiv.org/abs/2010.11929.
[17]	CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing transformer[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2021: 12294-12305.
[18]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[M]//Computer Vision - ECCV 2020. Cham: Springer International Publishing, 2020: 213-229.
[19]	JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09) [2022-01-28].https://arxiv.org/abs/2102.07074.
[20]	DENG Y Y, TANG F, DONG W M, et al. StyTr²: image style transfer with transformers[EB/OL]. [2022-01-12]. https://arxiv.org/abs/2105.14576.
[21]	LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10012-10022.
[22]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2022-01-28].https://arxiv.org/abs/2010.11929.
[23]	CHEN M, RADFORD A, CHILD R, et al. Generative pretraining from pixels[C]//The 37th International Conference on Machine Learning. New York: ACM, 2020: 1691-1703.
[24]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[25]	HAN K, XIAO A, WU E H, et al. Transformer in transformer[EB/OL]. (2021-10-26) [2022-01-28]. https://arxiv.org/abs/2103.00112.
[26]	ODENA A, DUMOULIN V, OLAH C. Deconvolution and checkerboard artifacts[EB/OL]. (2016-10-17) [2022-01-28]. https://distill.pub/2016/deconv-checkerboard/.
[27]	ZHANG Z F, WANG Z W, LIN Z, et al. Image super-resolution by neural texture transfer[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7974-7983.
[28]	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 5769-5779.
[29]	JOHNSON J, ALAHI A, LI F F. Perceptual losses for real-time style transfer and super-resolution[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 694-711.
[30]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-01- 28]. https://arxiv.org/abs/1409.1556.
[31]	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 5769-5779.
[32]	BRADSKI G. The openCV library[J]. Dr. Dobbʹs Journal: Software Tools for the Professional Programmer, 2000, 25(11): 120-123.
[33]	HARTIGAN J A, WONG M A. Algorithm AS 136: a K-means clustering algorithm[J]. Applied Statistics, 1979, 28(1): 100-108. DOI URL
[34]	SIMO-SERRA E, IIZUKA S, SASAKI K, et al. Learning to simplify: fully convolutional networks for rough sketch cleanup[J]. ACM Transactions on Graphics, 2016, 35(4): 121. 1-121.11.
[35]	CANNY J. A computational approach to edge detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986, PAMI-8(6): 679-698. DOI URL
[36]	YAN C, VANDERHAEGHE D, GINGOLD Y. A benchmark for rough sketch cleanup[J]. ACM Transactions on Graphics, 2020, 39(6): 163. 1-163.14.
[37]	ZHANG Y L, LI K P, LI K, et al. Image super-resolution using very deep residual channel attention networks[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 286-301.
[38]	HUANG X, LIU M Y, BELONGIE S, et al. Multimodal unsupervised image-to-image translation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 172-189.
[39]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer International Publishing, 2015: 234-241.
[40]	OKTAY O, SCHLEMPER J, FOLGOC L L, et al. Attention U-net: learning where to look for the pancreas[EB/OL]. (2018-05-20) [2022-01-28]. https://arxiv.org/abs/1804.03999.
[41]	ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12868-12878.
[42]	SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]//The 30th International Conference on Neural Information Processing Systems. New York: ACM, 2016: 2234-2242.
[43]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6629-6640.