Image to 3D vase generation technology combining procedural content generation and diffusion models

doi:10.11996/JG.j.2095-302X.2025020332

Abstract

Abstract:

In the traditional manual production of 3D content, 3D meshes and textures serve as the foundational elements in constructing 3D assets. To enhance the visual representation and rendering performance of 3D assets, the meshes are typically constructed using quadrilateral faces, requiring optimal topology and UV mapping. Moreover, 3D textures must be congruent with the geometric shape and maintain global consistency. However, current 3D content generation technologies based on latent diffusion models fail to meet these standards, limiting their potential in practical applications. At the same time, procedural content generation techniques have gained widespread application in the gaming and architectural industries due to their ability to systematically produce a vast array of 3D assets that conform to industry best practices. To improve the usability of generated assets, an integrated solution combining procedural content generation with diffusion model techniques was proposed. Using the 3D rotational body example of a vase, the image-to-3D asset generation problem was divided into two principal tasks: 3D mesh reconstruction and 3D texture generation. In the domain of 3D mesh reconstruction, a novel vase generation program was developed, and a deep neural network was trained to learn the mapping between image features and procedural parameters, thereby facilitating the reconstruction from a 2D image to a 3D model. For3D texture generation, a novel two-stage texturing strategy was introduced, combining multi-view image synthesis and multi-view consistency sampling techniques to produce high quality texture maps with global coherence. In summary, a scheme for the automatic construction of 3D vase assets from images was presented, which can be generalized to generate other 3D rotational body content and holds promise for applications in generating other types of 3D content.

Key words: diffusion models, procedural content generation, 3D reconstruction, texture generation, deep learning

CLC Number:

TP391

SUN Heyi, LI Yixiao, TIAN Xi, ZHANG Songhai. Image to 3D vase generation technology combining procedural content generation and diffusion models[J]. Journal of Graphics, 2025, 46(2): 332-344.

Figures/Tables 10

Fig. 1 Structure of the vase generation program

Fig. 2 Illustration of curve self-intersection optimization((a) Self-intersection occurs in the side profile curve of the vase; (b) Calculation of the intersection point (PC) position; (c) Adjustment of the value range of parameter a to eliminate self-intersection)

Fig. 3 Illustration of UV layout optimization ((a) Split line detection; (b) Initial UV layout; (c) Optimized UV layout)

Fig. 4 Illustration of the texture generation process

Fig. 5 Illustration of prediction results at different sampling steps

Table 1 Comparison of reconstruction accuracy

方法	Chamfer Distance
Meshy	0.020 3
TripoSR	0.015 2
本文方法	0.009 8

Fig. 6 Comparison of reconstruction results ((a) Input image; (b) Ours; (c) TripoSR; (d) Meshy)

Fig. 7 Comparison of texture generation results ((a) Ours; (b) Texture; (c) Meshy)

Table 2 comparison of texture generation results and efficiency

方法	FID	生成时间/s
Texture^[23]	72.45	98
Meshy	77.81	120
本文方法(粗糙纹理)	79.12	5
本文方法(精细纹理)	61.20	85

Fig. 8 Results of user study

References 33

[1]	HENDRIKX M, MEIJER S, VAN DER VELDEN J, et al. Procedural content generation for games: a survey[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2013, 9(1): 1.
[2]	PEARL O, LANG I, HU Y H, et al. GeoCode: interpretable shape programs[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2212.11715.
[3]	LONG X X, GUO Y C, LIN C, et al. Wonder3D: single image to 3D using cross-domain diffusion[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 9970-9980.
[4]	KIM J, KOO J, YEO K, et al. SyncTweedies: a general generative framework based on synchronized diffusions[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2403.14370.
[5]	SU H, HUANG Q X, MITRA N J, et al. Estimating image depth using shape collections[J]. ACM Transactions on Graphics, 2014, 33(4): 37.
[6]	KAR A, TULSIANI S, CARREIRA J, et al. Category-specific object reconstruction from a single image[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1966-1974.
[7]	HUANG Q X, WANG H, KOLTUN V. Single-view reconstruction via joint analysis of image and shape collections[J]. ACM Transactions on Graphics, 2015, 34(4): 87.
[8]	SUN X Y, WU J J, ZHANG X M, et al. Pix3D: dataset and methods for single-image 3D shape modeling[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 2974-2983.
[9]	CHOY C B, XU D F, GWAK J, et al. 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction[C]// The 14th European Conference on Computer Vision. Cham: Springer, 2016: 628-644.
[10]	XIE H Z, YAO H X, SUN X S, et al. Pix2Vox: context-aware 3D reconstruction from single and multi-view images[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2690-2698.
[11]	LIU R S, WU R D, VAN HOORICK B, et al. Zero-1-to-3: zero-shot one image to 3D object[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 9298-9309.
[12]	SHI R X, CHEN H S, ZHANG Z Y, et al. Zero123++:a single image to consistent multi-view diffusion base model[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2310.15110.
[13]	LIN Y K, HAN H N, GONG C Q, et al. Consistent123: one image to highly consistent 3D asset using case-aware diffusion priors[C]// The 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 6715-6724.
[14]	CHEN H S, SHI R X, LIU Y L, et al. Generic 3D diffusion adapter using controlled multi-view editing[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2403.12032.
[15]	VOLETI V, YAO C H, BOSS M, et al. SV3D: novel multi-view synthesis and 3D generation from a single image using latent video diffusion[C]// The 18th European Conference on Computer Vision. Cham: Springer, 2025: 439-457.
[16]	BEKINS D, ALIAGA D G. Build-by-number: rearranging the real world to visualize novel architectural spaces[C]// 2005 VIS 05. IEEE Visualization. New York: IEEE Press, 2005: 143-150.
[17]	MÜLLER P, ZENG G, WONKA P, et al. Image-based procedural modeling of facades[J]. ACM Transactions on Graphics, 2007, 26(3): 85-es.
[18]	ZHOU Y C, QI H Z, ZHAI Y X, et al. Learning to reconstruct 3D Manhattan wireframes from a single image[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7698-7707.
[19]	LI C J, PAN H, BOUSSEAU A, et al. Sketch2CAD: sequential CAD modeling by sketching in context[J]. ACM Transactions on Graphics (TOG), 2020, 39(6): 164.
[20]	LI C J, PAN H, BOUSSEAU A, et al. Free2CAD: parsing freehand drawings into CAD commands[J]. ACM Transactions on Graphics, 2022, 41(4): 93.
[21]	YU X, DAI P, LI W B, et al. Texture generation on 3D meshes with point-UV diffusion[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4206-4216.
[22]	ZENG X F, CHEN X, QI Z Q, et al. Paint3D: paint anything 3D with lighting-less texture diffusion models[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2312.13913.
[23]	RICHARDSON E, METZER G, ALALUF Y, et al. TEXTure: text-guided texturing of 3D shapes[C]// 2023 ACM SIGGRAPH Conference Proceedings. New York: ACM, 2023: 54.
[24]	CHEN D Z, SIDDIQUI Y, LEE H Y, et al. Text2Tex: text-driven texture synthesis via diffusion models[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 18558-18568.
[25]	TANG J X, LU R J, CHEN X K, et al. InTeX: interactive text-to-texture synthesis via unified depth-aware inpainting[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2403.11878.
[26]	SHI Y C, WANG P, YE J L, et al. MVDream:multi-view diffusion for 3D generation[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2308.16512.
[27]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[28]	BHAT S F, BIRKL R, WOFK D, et al. ZoeDepth: zero-shot transfer by combining relative and metric depth[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2302.12288.
[29]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 574.
[30]	SONG J, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-06-18]. https://arxiv.org/abs/2010.02502.
[31]	LUGMAYR A, DANELLJAN M, ROMERO A, et al. Repaint: inpainting using denoising diffusion probabilistic models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11461-11471.
[32]	AVRAHAMI O, FRIED O, LISCHINSKI D. Blended latent diffusion[J]. ACM Transactions on Graphics, 2023, 42(4): 149.
[33]	LIU Y X, XIE M S, LIU H Y, et al. Text-guided texturing by synchronized multi-view diffusion[C]// 2024 SIGGRAPH Asia Conference Papers. New York: ACM, 2024: 60.