Generalization optimization method for text to material texture maps based on diffusion model

doi:10.11996/JG.j.2095-302X.2025010139

Abstract

Abstract:

Considering the current situation where existing material texture datasets lack sufficient textual descriptions, while pure image datasets are massive in scale, and the difficulty of obtaining additional hyperparameters to generate new results when traditional generative models encounter inference errors, a generalized optimization method for text to material texture maps based on a stable diffusion model was proposed. The model was trained in a staged manner: firstly, a large-scale pure image dataset was used to finetune the diffusion model to fit image generation. Secondly, a small-scale dataset with text annotations was employed to learn semantic information. Thirdly, a new decoder was introduced to reconstruct texture maps from the latent codes generated by the diffusion model; ultimately, multiple randomly generated texture maps that conformed to the given descriptions were obtained by inputting textual descriptions. The method employed the Colossal architecture to organize the code, significantly reducing hardware requirements for training. By separating the tasks of image fitting and semantic information learning, with large-scale image datasets used for model parameter fitting and small-scale text data used for learning semantic information, the method enhanced the generalization of the model and reduced the demand for multimodal dataset scale.

Key words: diffusion model, generalization, multimodal, text-driven texture generation, material editor

CLC Number:

TP391

TU Qinghao, LI Yuanqi, LIU Yifan, GUO Jie, GUO Yanwen. Generalization optimization method for text to material texture maps based on diffusion model[J]. Journal of Graphics, 2025, 46(1): 139-149.

Figures/Tables 12

Fig. 1 Model Architecture

Fig. 2 Phase 1: Training for image generation task

Fig. 3 Diagram of modality gap

Fig. 4 Phase 2: Learning semantic information

Fig. 5 CLIP feature space[28]

Fig. 6 Phase 3: reconstruction of texture mapping

Fig. 7 Image dataset

Fig. 8 Multimodal dataset ((a) Wood tiles with herringbone and zigzagged pattern; (b) White tiles with black grid and small flower pattern; (c) Tiles with blue and white checkboard pattern; (d) Red black and white fabric with diamond and checkboard nattern; (e) Brown reflective shiny marble with cracked pattern; (f) Dark brown dirty gravel with pitted pattern; (g) Cliff rock with stratified pattern; (h) Brown old brick wall with staggered pattern)

Table 1 Quantitative analysis results

模型	材质种类	IS↑	FID↓
Text2Mat	织物	3.23	91.45
	瓷砖	7.49	69.06
	地砖	1.92	76.29
	木板	3.17	49.38
	岩石	2.50	43.33
	所有测试数据均值	4.02	75.31
Ours	织物	4.08	62.86
	瓷砖	10.14	73.22
	地砖	6.45	55.13
	木板	2.92	32.18
	岩石	1.48	47.94
	所有测试数据均值	4.63	64.73

Fig. 9 Comparison of results between our work and Text2Mat/Polycam ((a) Black and blue checkboard tiled tiles; (b) Dirty ground; (c) Dark brown leather with diamond pattern; (d) Arc paved pavement; (e) White and purple ceramic with chequered pattern; (f) Clean red smooth tiles with I-shaped pattern; (g) Shiny silver metal; (h) black fabric; (i) Leather)

Fig. 10 Seamless textures ((a) Light green and white tiles with flower pattern; (b) Blue tiles with chequered pattern)

Fig. 11 Results generated by the proposed method using different formats of text descriptions ((a) Yellow tiles chequered; (b) Yellow tiles with chequered pattern; (c) Yellow and brown tiles arranged in chequered pattern; (d) A tile texture with brown chequered pattern; (e) A tile texture which has yellow and brown chequered and camouflage pattern)

References 33

[1]	ZHOU Z M, CHEN G J, DONG Y, et al. Sparse-as-possible SVBRDF acquisition[J]. ACM Transactions on Graphics (TOG), 2016, 35(6): 189.
[2]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[3]	DONG Y, WANG J P, TONG X, et al. Manifold bootstrapping for SVBRDF capture[J]. ACM Transactions on Graphics (TOG), 2010, 29(4): 98.
[4]	DESCHAINTRE V, AITTALA M, DURAND F, et al. Single-image SVBRDF capture with a rendering-aware deep network[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 128.
[5]	GAO D, LI X, DONG Y, et al. Deep inverse rendering for high-resolution SVBRDF estimation from an arbitrary number of images[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 134.
[6]	GUO J, LAI S C, TAO C Z, et al. Highlight-aware two-stream network for single-image SVBRDF acquisition[J]. ACM Transactions on Graphics (TOG), 2021, 40(4): 123.
[7]	ZHOU X L, KALANTARI N K. Adversarial single‐image SVBRDF estimation with hybrid training[J]. Computer Graphics Forum, 2021, 40(2): 315-325.
[8]	HENZLER P, DESCHAINTRE V, MITRA N J, et al. Generative modelling of BRDF textures from flash images[J]. ACM Transactions on Graphics (TOG), 2021, 40(6): 284.
[9]	HU Y W, DORSEY J, RUSHMEIER H. A novel framework for inverse procedural texture modeling[J]. ACM Transactions on Graphics (TOG), 2019, 38(6): 186.
[10]	SHI L, LI B C, HAŠAN M, et al. Match: differentiable material graphs for procedural material capture[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 196.
[11]	GUO Y, SMITH C, HAŠAN M, et al. MaterialGAN: reflectance capture using a generative SVBRDF model[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 254.
[12]	ZHOU X L, HASAN M, DESCHAINTRE V, et al. TileGen: tileable, controllable material generation and capture[C]// The SIGGRAPH Asia 2022 Conference. New York: ACM, 2022: 34.
[13]	GUERRERO P, HAŠAN M, SUNKAVALLI K, et al. MatFormer: a generative model for procedural materials[J]. ACM Transactions on Graphics (TOG), 2022, 41(4): 46.
[14]	SOHL-DICKSTEIN J, WEISS E A, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[EB/OL]. [2024-05-29]. https://dl.acm.org/ doi/10.5555/3045118.3045358.
[15]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 574.
[16]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10684-10695.
[17]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.
[18]	PEEBLES W, XIE S N. Scalable diffusion models with transformers[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4195-4205.
[19]	LIU L P, REN Y, LIN Z J, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. (2022-02-18) [2024-05-31]. https://arxiv.org/abs/2202.09778.
[20]	BAO F, LI C X, ZHU J, et al. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models[EB/OL]. (2022-01-16) [2024-05-31]. https://arxiv.org/abs/2201.06503.
[21]	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. (2022-04-21) [2024-05-31]. https://arxiv.org/abs/2204.06125.
[22]	SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language understanding[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2643.
[23]	SCHUHMANN C, BEAUMONT R, VENCU R, et al. LAION-5B: an open large-scale dataset for training next generation image-text models[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1833.
[24]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-05-29]. https://dblp.uni-trier.de/db/conf/icml/icml2021.html#RadfordKHRGASAM21.
[25]	MCINNES L, HEALY J, MELVILLE J. UMAP: uniform manifold approximation and projection for dimension reduction[EB/OL]. (2018-02-09) [2024-04-31]. https://arxiv.org/abs/1802.03426.
[26]	LIANG W X, ZHANG Y H, KWON Y, et al. Mind the gap: understanding the modality gap in multi-modal contrastive representation learning[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 1280.
[27]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
[28]	ZHOU Y F, LIU B C, ZHU Y Z, et al. Shifted diffusion for text-to-image generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10157-10166.
[29]	SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 2234-2242.
[30]	IKEUCHI K. Computer vision: a reference guide[M]. 2nd ed. Cham: Springer, 2021: 40.
[31]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6629-6640.
[32]	HE Z, GUO J, ZHANG Y, et al. Text2Mat: generating materials from text[EB/OL]. [2024-05-29]. https://diglib.eg.org/items/0216dc7e-da3d-4305-8698-fd0463337316.
[33]	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826.