CTH-Net：从线稿和颜色点生成服装图像的CNN-Transformer混合网络

doi:10.11996/JG.j.2095-302X.2023010120

图学学报 ›› 2023, Vol. 44 ›› Issue (1): 120-130.DOI: 10.11996/JG.j.2095-302X.2023010120

• 计算机图形学与虚拟现实 • 上一篇下一篇

CTH-Net：从线稿和颜色点生成服装图像的CNN-Transformer混合网络

潘东辉(), 金映含, 孙旭, 刘玉生, 张东亮()

浙江大学计算机科学与技术学院，浙江杭州 310000

收稿日期:2022-04-24 修回日期:2022-07-01 出版日期:2023-10-31 发布日期:2023-02-16
通讯作者: 张东亮
作者简介:潘东辉(1997-)，男，硕士研究生。主要研究方向为数字图像处理。E-mail：417969567@qq.com
基金资助:
国家重点研发计划(2022YFB3303100);国家自然科学基金项目(61972340);国家自然科学基金项目(61732015)

CTH-Net: CNN-Transformer hybrid network for garment image generation from sketches and color points

PAN Dong-hui(), JIN Ying-han, SUN Xu, LIU Yu-sheng, ZHANG Dong-liang()

College of Computer Science and Technology, Zhejiang University, Hangzhou Zhejiang 310000, China

Received:2022-04-24 Revised:2022-07-01 Online:2023-10-31 Published:2023-02-16
Contact: ZHANG Dong-liang
About author:PAN Dong-hui (1997-), master student. His main research interest covers digital image processing. E-mail：417969567@qq.com
Supported by:
National Key R&D Program of China(2022YFB3303100);National Natural Science Foundation of China(61972340);National Natural Science Foundation of China(61732015)

摘要/Abstract

摘要：

绘制服装效果图是服装设计过程中重要的一环，针对目前存在智能化程度不足、对用户绘画水平和想象能力要求较高等问题，提出了一种使用线稿和颜色点生成服装图像的CNN-Transformer混合网络CTH-Net。CTH-Net结合卷积神经网络(CNN)在提取局部信息和Transformer在处理长距离依赖方面的优势，将2个模型架构进行高效混合，并设计ToPatch和ToFeatureMap模块减小输入Transformer的数据量和维度以降低计算资源消耗。CTH-Net由3个阶段组成：一是草图阶段，旨在预测服装的颜色分布，获得没有渐变和阴影的水彩式图像；二是细化阶段，将水彩式图像细化为有光影效果的服装图像；三是调优阶段，组合一、二阶段的输出进一步优化生成质量。实验结果表明，仅需输入线稿和少量颜色点，CTH-Net便能生成出高质量的服装图像。与现有的方法相比，该网络生成图像的真实感和准确性均有较大优势。

关键词: 深度学习, 卷积神经网络, 图像生成, Transformer

Abstract:

Drawing garment images is an important part of garment design. To address the problems such as low intelligence and high requirements for users' drawing skills and imagination, a CNN-Transformer hybrid network (CTH-Net) was proposed to generate garment images from sketches and color points. CTH-Net combined the advantages of convolutional neural networks (CNN) in extracting local information and Transformer in processing long-range dependencies, efficiently fusing the architectures of these two models. The ToPatch and ToFeatureMap modules were also designed to reduce the amount and dimension of data input into Transformer, thus reducing the consumption of computing resources. CTH-Net consisted of three phases: the first drafting phase, which aimed to predict the color distribution of garments and obtain watercolor images without gradients and shadows; the second refinement phase, which refined the watercolor image into a realistic garment image; the third tuning phase, which combined the outputs of the above two phases to further optimize the generation quality. The experimental results show that CTH-Net could generate high-quality garment images by simply inputting sketches and some color points. The proposed network could outperform the existing methods in the realism and accuracy of the generated images.

Key words: deep learning, convolutional neural network, image generation, Transformer

中图分类号:

TP391

潘东辉, 金映含, 孙旭, 刘玉生, 张东亮. CTH-Net：从线稿和颜色点生成服装图像的CNN-Transformer混合网络[J]. 图学学报, 2023, 44(1): 120-130.

PAN Dong-hui, JIN Ying-han, SUN Xu, LIU Yu-sheng, ZHANG Dong-liang. CTH-Net: CNN-Transformer hybrid network for garment image generation from sketches and color points[J]. Journal of Graphics, 2023, 44(1): 120-130.

图/表 16

图1 本文的服装图像自动生成方法((a)线稿和颜色点；(b)服装图像)

Fig. 1 Proposed method of automatic image generation ((a) Sketches and color points; (b) Garment image)

图2 本文方法流程示意

Fig. 2 Overview of our method

图3 CNN-Transformer残差块(CTRB)结构示意

Fig. 3 Structure of CNN-Transformer residual block (CTRB)

图4 CTRB 中的Trans模块结构

Fig. 4 Structure of Transformer module used in CTRB

图5 服装图像预处理((a)服装效果图；(b)图像规范化；(c)水彩式图像；(d)平滑处理；(e)线稿图像；(f)颜色点；(g)模糊处理)

Fig. 5 Pre-processing of a garment image ((a) Garment image; (b) Image normalization; (c) Watercolor image; (d) Smoothed image; (e) Sketches; (f) Color points; (g) Blurred image)

图6 线稿提取流程((a)服装图像；(b)脏线稿；(c~d)训练数据示例；(e~f)目标数据示例)

Fig. 6 Process of sketch extraction ((a) Garment image; (b) Image with noise; (c-d) Examples of training data; (e-f) Examples of target data)

表1 线稿提取模型训练参数

Table 1 Parameter settings of training the sketch extraction model

参数名称	参数值
Optimizer	AdamW
Learning rate	0.000 1
Epoch	2 000
CPU	E3-1230v2
GPU	GTX1660s
Memory	16 G
OS	Windows 10

图7 部分线稿提取结果

Fig. 7 Results of sketch extraction

表2 CTH-Net模型训练参数

Table 2 Parameter settings of training the CTH-Net

参数名称	参数值
Optimizer	AdamW
Learning rate	0.000 01
λ_∗	λ_rec:1, λ_fea:0.01, λ_adv:0.001, λ_R:0.01
Epoch	Drafting:500, Refinement:500, Tuning:50
CPU	E5-2695 v3
GPU	Tesla P100
Memory	32 G
OS	Ubuntu20.04

图8 CTH-Net与其他网络模型生成结果的对比((a)大面积图案；(b)条纹；(c)纯色；(d)小面积图案；(e)蓝色米老鼠)

Fig. 8 Comparisons of generation results between CTH-Net and other methods ((a) Large area pattern; (b) Stripe pattern; (c) Solid color pattern; (d) Small area pattern; (e) Blue mickey pattern)

图9 CTH-Net与其他网络模型生成结果的局部对比

Fig. 9 Comparisons of generation details between CTH-Net and other networks ((a) Inputs; (b) Aaaention-UNet; (c) Pix2PixHD; (d) VQGAN; (e) CTH-Net)

图10 CTH-Net与其他网络模型生成结果的更多对比

Fig. 10 More comparisons of generation results between CTH-Net and other networks ((a) Inputs; (b) MUNIT; (c) UNet; (d) Pix2PixHD; (e) Attention-UNet; (f) TrandGAN; (g) VQGAN; (h) CTH-Net)

图11 CTH-Net的生成结果展示

Fig. 11 Results of generated images of CTH-Net ((a) Inputs; (b) CTH-Net; (c) Inputs; (d) CTH-Net)

表3 CTH-Net与其他方法的量化对比

Table 3 Comparisons of quantitative evaluations between CTH-Net and other methods

方法	HPR	IS	FID
MUNIT	0.0	3.852	5.790
UNet	1.3	4.266	2.340
PixPixHD	1.0	4.133	2.464
Attention-UNet	1.7	4.287	2.191
TransGAN	0.8	4.174	2.419
VQGAN	1.9	4.304	2.085
FashionImageDesign	3.4	-	-
CTH-Net^† (本文)	4.4	4.427	1.872
CTH-Net (本文)	5.5	4.583	1.496

表4 ToPatch与ToFeatureMap模块的作用

Table 4 Performance of ToPatch and ToFeatureMap

方法	Batch size^*	训练一轮时间(s)	显存占用(G)	IS	FID
有ToPatch与ToFeatureMap	4	365.2	2.223	4.583	1.496
无ToPatch与ToFeatureMap	4	1204.6	7.854	4.989	1.367

图12 逐渐增加CTRB数量的图像生成结果(从左至右)

Fig. 12 Results of generated images when gradually increasing the number of CTRBs (from left to right)

参考文献 43

[1]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//The 31st Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 5998-6008.
[2]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[3]	ISOLA P, ZHU J Y, ZHOU T H, et al. Image-to-image translation with conditional adversarial networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5967-5976.
[4]	SANGKLOY P, LU J W, FANG C, et al. Scribbler: controlling deep image synthesis with sketch and color[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6836-6845.
[5]	ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2242-2251.
[6]	WANG T C, LIU M Y, ZHU J Y, et al. High-resolution image synthesis and semantic manipulation with conditional GANs[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 8798-8807.
[7]	ZHU J Y, ZHANG R, PATHAK D, et al. Toward multimodal image-to-image translation[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 465-476.
[8]	ZHANG R, ISOLA P, EFROS A A. Colorful image colorization[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 649-666.
[9]	YOU S, YOU N, PAN M X. PI-REC: progressive image reconstruction network with edge and color domain[EB/OL]. (2019-03-25) [2022-01-28]. https://arxiv.org/abs/1903.10146.
[10]	REN H, LI J, GAO N. Two-stage sketch colorization with color parsing[J]. IEEE Access, 2019, 8: 44599-44610. DOI URL
[11]	CHONG M J, FORSYTH D. JoJoGAN: one shot face stylization[EB/OL]. [2022-01-28].https://arxiv.org/abs/2112.11641.
[12]	KARRAS T, LAINE S, AILA T M. A style-based generator architecture for generative adversarial networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2019: 4396-4405.
[13]	LIN Z, ZHANG Z, ZHANG K R, et al. Interactive style transfer: all is your palette[EB/OL]. [2022-01-28]. https://arxiv.org/abs/2203.13470.
[14]	CHEN P, ZHANG Y, LI Z, et al. Few-Shot Incremental Learning for Label-to-Image Translation[C]//2022 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3697-3707.
[15]	LI Y, YU X G, HAN X G, et al. A deep learning based interactive sketching system for fashion images design[EB/OL]. (2020-10-09) [2022-01-12].https://arxiv.org/abs/2010.04413.
[16]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-12]. https://arxiv.org/abs/2010.11929.
[17]	CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing transformer[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2021: 12294-12305.
[18]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[M]//Computer Vision - ECCV 2020. Cham: Springer International Publishing, 2020: 213-229.
[19]	JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09) [2022-01-28].https://arxiv.org/abs/2102.07074.
[20]	DENG Y Y, TANG F, DONG W M, et al. StyTr²: image style transfer with transformers[EB/OL]. [2022-01-12]. https://arxiv.org/abs/2105.14576.
[21]	LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 10012-10022.
[22]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2022-01-28].https://arxiv.org/abs/2010.11929.
[23]	CHEN M, RADFORD A, CHILD R, et al. Generative pretraining from pixels[C]//The 37th International Conference on Machine Learning. New York: ACM, 2020: 1691-1703.
[24]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[25]	HAN K, XIAO A, WU E H, et al. Transformer in transformer[EB/OL]. (2021-10-26) [2022-01-28]. https://arxiv.org/abs/2103.00112.
[26]	ODENA A, DUMOULIN V, OLAH C. Deconvolution and checkerboard artifacts[EB/OL]. (2016-10-17) [2022-01-28]. https://distill.pub/2016/deconv-checkerboard/.
[27]	ZHANG Z F, WANG Z W, LIN Z, et al. Image super-resolution by neural texture transfer[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7974-7983.
[28]	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 5769-5779.
[29]	JOHNSON J, ALAHI A, LI F F. Perceptual losses for real-time style transfer and super-resolution[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 694-711.
[30]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-01- 28]. https://arxiv.org/abs/1409.1556.
[31]	GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 5769-5779.
[32]	BRADSKI G. The openCV library[J]. Dr. Dobbʹs Journal: Software Tools for the Professional Programmer, 2000, 25(11): 120-123.
[33]	HARTIGAN J A, WONG M A. Algorithm AS 136: a K-means clustering algorithm[J]. Applied Statistics, 1979, 28(1): 100-108. DOI URL
[34]	SIMO-SERRA E, IIZUKA S, SASAKI K, et al. Learning to simplify: fully convolutional networks for rough sketch cleanup[J]. ACM Transactions on Graphics, 2016, 35(4): 121. 1-121.11.
[35]	CANNY J. A computational approach to edge detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1986, PAMI-8(6): 679-698. DOI URL
[36]	YAN C, VANDERHAEGHE D, GINGOLD Y. A benchmark for rough sketch cleanup[J]. ACM Transactions on Graphics, 2020, 39(6): 163. 1-163.14.
[37]	ZHANG Y L, LI K P, LI K, et al. Image super-resolution using very deep residual channel attention networks[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 286-301.
[38]	HUANG X, LIU M Y, BELONGIE S, et al. Multimodal unsupervised image-to-image translation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 172-189.
[39]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer International Publishing, 2015: 234-241.
[40]	OKTAY O, SCHLEMPER J, FOLGOC L L, et al. Attention U-net: learning where to look for the pancreas[EB/OL]. (2018-05-20) [2022-01-28]. https://arxiv.org/abs/1804.03999.
[41]	ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12868-12878.
[42]	SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]//The 30th International Conference on Neural Information Processing Systems. New York: ACM, 2016: 2234-2242.
[43]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6629-6640.

CTH-Net：从线稿和颜色点生成服装图像的CNN-Transformer混合网络

CTH-Net: CNN-Transformer hybrid network for garment image generation from sketches and color points

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 43

相关文章 15

编辑推荐

Metrics

本文评价

[1]	杨陈成, 董秀成, 侯兵, 张党成, 向贤明, 冯琪茗. 基于参考的Transformer纹理迁移深度图像超分辨率重建[J]. 图学学报, 2023, 44(5): 861-867.
[2]	党宏社, 许怀彪, 张选德. 融合结构信息的深度学习立体匹配算法[J]. 图学学报, 2023, 44(5): 899-906.
[3]	翟永杰, 郭聪彬, 王乾铭, 赵宽, 白云山, 张冀. 基于隐含空间知识融合的输电线路多金具检测方法[J]. 图学学报, 2023, 44(5): 918-927.
[4]	杨红菊, 高敏, 张常有, 薄文, 武文佳, 曹付元. 一种面向图像修复的局部优化生成模型[J]. 图学学报, 2023, 44(5): 955-965.
[5]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
[6]	郝帅, 赵新生, 马旭, 张旭, 何田, 侯李祥. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676.
[7]	曹义亲, 周一纬, 徐露. 基于E-YOLOX的实时金属表面缺陷检测算法[J]. 图学学报, 2023, 44(4): 677-690.
[8]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[9]	邵俊棋, 钱文华, 徐启豪. 基于条件残差生成对抗网络的风景图生成[J]. 图学学报, 2023, 44(4): 710-717.
[10]	邓渭铭, 杨铁军, 李纯纯, 黄琳. 基于神经网络架构搜索的铭牌目标检测方法[J]. 图学学报, 2023, 44(4): 718-727.
[11]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738.
[12]	郭印宏, 王立春, 李爽. 基于重复性和特异性约束的图像特征匹配[J]. 图学学报, 2023, 44(4): 739-746.
[13]	李刚, 张运涛, 汪文凯, 张东阳. 采用DETR与先验知识融合的输电线路螺栓缺陷检测方法[J]. 图学学报, 2023, 44(3): 438-447.
[14]	毛爱坤, 刘昕明, 陈文壮, 宋绍楼. 改进YOLOv5算法的变电站仪表目标检测方法[J]. 图学学报, 2023, 44(3): 448-455.
[15]	王佳婧, 王晨, 朱媛媛, 王笑梅. 基于民国纸币的图元素匹配检索[J]. 图学学报, 2023, 44(3): 492-501.