图学学报 ›› 2025, Vol. 46 ›› Issue (1): 139-149.DOI: 10.11996/JG.j.2095-302X.2025010139
收稿日期:
2024-07-29
接受日期:
2024-09-18
出版日期:
2025-02-28
发布日期:
2025-02-14
通讯作者:
过洁(1986-),男,副教授,博士。主要研究方向为图形图像处理、计算机视觉等。E-mail:guojie@nju.edu.cn第一作者:
涂晴昊(1999-),女,硕士研究生。主要研究方向为数字图像处理与模式识别。E-mail:qinghaotu@126.com
TU Qinghao(), LI Yuanqi, LIU Yifan, GUO Jie(
), GUO Yanwen
Received:
2024-07-29
Accepted:
2024-09-18
Published:
2025-02-28
Online:
2025-02-14
Contact:
GUO Jie (1986-), associate professor, Ph.D. His main research interests cover graphic image processing, computer vision, etc. E-mail:guojie@nju.edu.cnFirst author:
TU Qinghao (1999-), master student. Her main research interests cover digital image processing and pattern recognition. E-mail:qinghaotu@126.com
摘要:
针对现有的材质贴图数据集存在着文字描述不足且纯图像数据集规模庞大的现状,及传统的生成模型推理错误时难以获得额外的超参数来生成新的结果等问题,提出一种基于稳定扩散模型的文本生成材质贴图的泛化性优化方法,采用分阶段的方式训练模型:使用大规模纯图像数据集对扩散模型进行微调,以拟合图像的生成;使用小规模含文本标注的数据集学习语义信息;引入新的解码器对扩散模型生成的隐编码重建得到材质贴图;最终可以通过输入文本描述获得多组随机生成的且符合描述的材质贴图结果。该方法使用Colossal架构组织代码,大大降低了训练的硬件要求;将图像拟合数据集、语义信息学习的工作分开,使用大规模图像数据集拟合模型参数,使用小规模文本数据学习语义信息,提高了模型的泛化性,减少了对多模态数据集规模的需求。
中图分类号:
涂晴昊, 李元琪, 刘一凡, 过洁, 郭延文. 基于扩散模型的文本生成材质贴图的泛化性优化方法[J]. 图学学报, 2025, 46(1): 139-149.
TU Qinghao, LI Yuanqi, LIU Yifan, GUO Jie, GUO Yanwen. Generalization optimization method for text to material texture maps based on diffusion model[J]. Journal of Graphics, 2025, 46(1): 139-149.
图8 多模态数据集((a)人字形和锯齿状木地板;(b)黑色网格和小花的白瓷砖;(c)蓝白格瓷砖;(d)红黑白菱形棋盘格面料;(e)棕色光面有纹路的大理石;(f)暗棕色脏污不平的地面;(g)岩石壁;(h)棕色旧砖墙)
Fig. 8 Multimodal dataset ((a) Wood tiles with herringbone and zigzagged pattern; (b) White tiles with black grid and small flower pattern; (c) Tiles with blue and white checkboard pattern; (d) Red black and white fabric with diamond and checkboard nattern; (e) Brown reflective shiny marble with cracked pattern; (f) Dark brown dirty gravel with pitted pattern; (g) Cliff rock with stratified pattern; (h) Brown old brick wall with staggered pattern)
模型 | 材质种类 | IS↑ | FID↓ |
---|---|---|---|
Text2Mat | 织物 | 3.23 | 91.45 |
瓷砖 | 7.49 | 69.06 | |
地砖 | 1.92 | 76.29 | |
木板 | 3.17 | 49.38 | |
岩石 | 2.50 | 43.33 | |
所有测试数据均值 | 4.02 | 75.31 | |
Ours | 织物 | 4.08 | 62.86 |
瓷砖 | 10.14 | 73.22 | |
地砖 | 6.45 | 55.13 | |
木板 | 2.92 | 32.18 | |
岩石 | 1.48 | 47.94 | |
所有测试数据均值 | 4.63 | 64.73 |
表1 定量分析结果
Table 1 Quantitative analysis results
模型 | 材质种类 | IS↑ | FID↓ |
---|---|---|---|
Text2Mat | 织物 | 3.23 | 91.45 |
瓷砖 | 7.49 | 69.06 | |
地砖 | 1.92 | 76.29 | |
木板 | 3.17 | 49.38 | |
岩石 | 2.50 | 43.33 | |
所有测试数据均值 | 4.02 | 75.31 | |
Ours | 织物 | 4.08 | 62.86 |
瓷砖 | 10.14 | 73.22 | |
地砖 | 6.45 | 55.13 | |
木板 | 2.92 | 32.18 | |
岩石 | 1.48 | 47.94 | |
所有测试数据均值 | 4.63 | 64.73 |
图9 本文工作与Text2Mat/Polycam的结果对比((a)黑色与蓝色的棋盘格地砖;(b)脏污的地面;(c)钻石图案的深棕色皮革;(d)弧形地砖;(e)白色和紫色棋盘格陶瓷;(f)富有规律性的清洁红色光滑的I形砖石;(g)闪耀的银色金属;(h)黑色织物;(i)皮革)
Fig. 9 Comparison of results between our work and Text2Mat/Polycam ((a) Black and blue checkboard tiled tiles; (b) Dirty ground; (c) Dark brown leather with diamond pattern; (d) Arc paved pavement; (e) White and purple ceramic with chequered pattern; (f) Clean red smooth tiles with I-shaped pattern; (g) Shiny silver metal; (h) black fabric; (i) Leather)
图10 无缝材质((a)浅绿色和白色带花卉图案的瓷砖;(b)蓝色方格瓷砖)
Fig. 10 Seamless textures ((a) Light green and white tiles with flower pattern; (b) Blue tiles with chequered pattern)
图11 本文方法使用不同格式的文本描述生成的结果((a)黄色方格瓷砖;(b)黄色方格花纹瓷砖;(c)黄色和棕色排列方格瓷砖;(d)棕色格子图案的瓷砖;(e)黄褐色棋盘格和迷彩图案的瓷砖)
Fig. 11 Results generated by the proposed method using different formats of text descriptions ((a) Yellow tiles chequered; (b) Yellow tiles with chequered pattern; (c) Yellow and brown tiles arranged in chequered pattern; (d) A tile texture with brown chequered pattern; (e) A tile texture which has yellow and brown chequered and camouflage pattern)
[1] | ZHOU Z M, CHEN G J, DONG Y, et al. Sparse-as-possible SVBRDF acquisition[J]. ACM Transactions on Graphics (TOG), 2016, 35(6): 189. |
[2] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. |
[3] | DONG Y, WANG J P, TONG X, et al. Manifold bootstrapping for SVBRDF capture[J]. ACM Transactions on Graphics (TOG), 2010, 29(4): 98. |
[4] | DESCHAINTRE V, AITTALA M, DURAND F, et al. Single-image SVBRDF capture with a rendering-aware deep network[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 128. |
[5] | GAO D, LI X, DONG Y, et al. Deep inverse rendering for high-resolution SVBRDF estimation from an arbitrary number of images[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 134. |
[6] | GUO J, LAI S C, TAO C Z, et al. Highlight-aware two-stream network for single-image SVBRDF acquisition[J]. ACM Transactions on Graphics (TOG), 2021, 40(4): 123. |
[7] | ZHOU X L, KALANTARI N K. Adversarial single‐image SVBRDF estimation with hybrid training[J]. Computer Graphics Forum, 2021, 40(2): 315-325. |
[8] | HENZLER P, DESCHAINTRE V, MITRA N J, et al. Generative modelling of BRDF textures from flash images[J]. ACM Transactions on Graphics (TOG), 2021, 40(6): 284. |
[9] | HU Y W, DORSEY J, RUSHMEIER H. A novel framework for inverse procedural texture modeling[J]. ACM Transactions on Graphics (TOG), 2019, 38(6): 186. |
[10] | SHI L, LI B C, HAŠAN M, et al. Match: differentiable material graphs for procedural material capture[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 196. |
[11] | GUO Y, SMITH C, HAŠAN M, et al. MaterialGAN: reflectance capture using a generative SVBRDF model[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 254. |
[12] | ZHOU X L, HASAN M, DESCHAINTRE V, et al. TileGen: tileable, controllable material generation and capture[C]// The SIGGRAPH Asia 2022 Conference. New York: ACM, 2022: 34. |
[13] | GUERRERO P, HAŠAN M, SUNKAVALLI K, et al. MatFormer: a generative model for procedural materials[J]. ACM Transactions on Graphics (TOG), 2022, 41(4): 46. |
[14] | SOHL-DICKSTEIN J, WEISS E A, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[EB/OL]. [2024-05-29]. https://dl.acm.org/ doi/10.5555/3045118.3045358. |
[15] | HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 574. |
[16] | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10684-10695. |
[17] | RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241. |
[18] | PEEBLES W, XIE S N. Scalable diffusion models with transformers[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4195-4205. |
[19] | LIU L P, REN Y, LIN Z J, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. (2022-02-18) [2024-05-31]. https://arxiv.org/abs/2202.09778. |
[20] | BAO F, LI C X, ZHU J, et al. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models[EB/OL]. (2022-01-16) [2024-05-31]. https://arxiv.org/abs/2201.06503. |
[21] | RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. (2022-04-21) [2024-05-31]. https://arxiv.org/abs/2204.06125. |
[22] | SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2643. |
[23] | SCHUHMANN C, BEAUMONT R, VENCU R, et al. LAION-5B: an open large-scale dataset for training next generation image-text models[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1833. |
[24] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-05-29]. https://dblp.uni-trier.de/db/conf/icml/icml2021.html#RadfordKHRGASAM21. |
[25] | MCINNES L, HEALY J, MELVILLE J. UMAP: uniform manifold approximation and projection for dimension reduction[EB/OL]. (2018-02-09) [2024-04-31]. https://arxiv.org/abs/1802.03426. |
[26] | LIANG W X, ZHANG Y H, KWON Y, et al. Mind the gap: understanding the modality gap in multi-modal contrastive representation learning[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1280. |
[27] | RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140. |
[28] | ZHOU Y F, LIU B C, ZHU Y Z, et al. Shifted diffusion for text-to-image generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10157-10166. |
[29] | SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 2234-2242. |
[30] | IKEUCHI K. Computer vision: a reference guide[M]. 2nd ed. Cham: Springer, 2021: 40. |
[31] | HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6629-6640. |
[32] | HE Z, GUO J, ZHANG Y, et al. Text2Mat: generating materials from text[EB/OL]. [2024-05-29]. https://diglib.eg.org/items/0216dc7e-da3d-4305-8698-fd0463337316. |
[33] | SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826. |
[1] | 郑洪岩, 王慧, 刘昊, 张志平, 杨晓娟, 孙涛. 基于隐式知识增强的KB-VQA知识检索策略研究[J]. 图学学报, 2024, 45(6): 1231-1242. |
[2] | 吴精乙, 景峻, 贺熠凡, 张世渝, 康运锋, 唐维, 孔德兰, 刘向栋. 基于多模态大模型的高速公路场景交通异常事件分析方法[J]. 图学学报, 2024, 45(6): 1266-1276. |
[3] | 张冀, 崔文帅, 张荣华, 王文彬, 李亚琦. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844. |
[4] | 何柳, 安然, 刘姝妍, 李润岐, 陶剑, 曾照洋. 基于知识图谱的航空多模态数据组织与知识发现技术研究[J]. 图学学报, 2024, 45(2): 300-307. |
[5] | 苑朝, 赵亚冬, 张耀, 王嘉璇, 徐大伟, 翟永杰, 朱松松. 基于YOLO轻量化的多模态行人检测算法[J]. 图学学报, 2024, 45(1): 35-46. |
[6] | 王欣雨, 刘慧, 朱积成, 盛玉瑞, 张彩明. 基于高低频特征分解的深度多模态医学图像融合网络[J]. 图学学报, 2024, 45(1): 65-77. |
[7] | 王吉, 王森, 蒋智文, 谢志峰, 李梦甜. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. |
[8] | 薛皓玮, 王美丽. 融合生物力学约束与多模态数据的手部重建[J]. 图学学报, 2023, 44(4): 794-800. |
[9] | 孙亚男, 温玉辉, 舒叶芷, 刘永进. 融合动作特征的多模态情绪识别 [J]. 图学学报, 2022, 43(6): 1159-1169. |
[10] | 李晓英, 余亚平. 基于多模态感官体验的儿童音画交互设计研究[J]. 图学学报, 2022, 43(4): 736-743. |
[11] | 邓壮林, 张绍兵, 成苗, 何莲. 多模态硬币图像单应性矩阵预测[J]. 图学学报, 2022, 43(3): 361-369. |
[12] | 胡俊, 顾晶晶, 王秋红. 基于遥感图像的多模态小目标检测[J]. 图学学报, 2022, 43(2): 197-204. |
[13] | 黄 欢 , 孙力娟 , 曹 莹 , 郭 剑 , 任恒毅 . 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14. |
[14] | 穆大强, 李 腾 . 基于多模态融合的人脸反欺骗技术[J]. 图学学报, 2020, 41(5): 750-756. |
[15] | 蒋圣南, 陈恩庆, 郑铭耀, 段建康 . 基于 ResNeXt 的人体动作识别[J]. 图学学报, 2020, 41(2): 277-282. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||