PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion

doi:10.11996/JG.j.2095-302X.2025050980

Abstract

Abstract:

Panoramic images, which can express the overall information of the surrounding environment, have become an important way to construct virtual scenes. However, amidst the rise of artificial intelligence generated content (AIGC) technology, especially diffusion models trained on large-scale text image datasets and parameter-efficient fine-tuning (PEFT) techniques, research on the generation and rapid transfer of panoramic images is still insufficient. To address the challenges posed by the scarcity and spatial distortion of panoramic image datasets, 14 000 open-source panoramic image datasets were collected, finely annotated, and filtered through projection transformation. Based on this, the PanoLoRA method was proposed. In the process of extracting spatial features from the original convolution and self-attention modules, PanoLoRA additionally incorporated spherical convolution and LoRA (low-rank adaptation) modules. This enabled the explicit extraction of spherical features from panoramic images, which were then fused with the original planar features, thereby achieving efficient transfer learning for panoramic image generation while retaining the strong image generation ability of Stable Diffusion. The experimental results demonstrated that PanoLoRA outperformed the latest 5 Parameter-Efficient Fine-Tuning methods in comparison tests using the collected text panoramic image dataset, achieving comprehensive advantages and improving the quality of image generation and graphic consistency. A series of ablation experiments were conducted to verify the effectiveness of each algorithm module.

Key words: diffusion model, panoramic image, parameter-efficient fine-tuning, transfer learning, LoRA

CLC Number:

YE Wenlong, CHEN Bin. PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion[J]. Journal of Graphics, 2025, 46(5): 980-989.

Figures/Tables 7

References 53

[1]	ARGYRIOU L, ECONOMOU D, BOUKI V. Design methodology for 360° immersive video applications: the case study of a cultural heritage virtual tour[J]. Personal and Ubiquitous Computing, 2020, 24(6): 843-859.
[2]	KITTEL A, LARKIN P, CUNNINGHAM I, et al. 360° virtual reality: a SWOT analysis in comparison to virtual reality[J]. Frontiers in Psychology, 2020, 11: 563474.
[3]	SOMANATH G, KURZ D. HDR environment map estimation for real-time augmented reality[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11293-11301.
[4]	KINZIG C, CORTÉS I, FERNÁNDEZ C, et al. Real-time seamless image stitching in autonomous driving[C]// 2022 25th International Conference on Information Fusion. New York: IEEE Press, 2022: 1-8.
[5]	WU S S, TANG H, JING X Y, et al. Cross-view panorama image synthesis[J]. IEEE Transactions on Multimedia, 2022, 25: 3546-3559.
[6]	FENG M Y, LIU J L, CUI M M, et al. Diffusion360: seamless 360 degree panoramic image generation based on diffusion models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2311.13141.
[7]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 574.
[8]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10674-10685.
[9]	ZAKEN E B, GOLDBERG Y, RAVFOGEL S. BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.10199.
[10]	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.09685.
[11]	HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1902.00751.
[12]	YEH S Y, HSIEH Y G, GAO Z D, et al. Navigating text-to-image customization: from LyCORIS fine-tuning to model evaluation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2309.14859
[13]	TANG N Y, FU M H, ZHU K, et al. Low-rank attention side-tuning for parameter-efficient fine-tuning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2402.04009.
[14]	COORS B, CONDURACHE A P, GEIGER A. SphereNet: learning spherical representations for detection and classification in omnidirectional images[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 525-541.
[15]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6629-6640.
[16]	BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1801.01401.
[17]	HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: a reference-free evaluation metric for image captioning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2104.08718.
[18]	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-12-01]. https://3dvar.com/Ramesh2022Hierarchical.pdf.
[19]	SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713-4726.
[20]	BROOKS T, PEEBLES B, HOMES C, et al. Video generation models as world simulators[EB/OL]. [2024-12-01]. https://openai.com/research/video-generation-models-as-world-simulators.
[21]	张冀, 崔文帅, 张荣华, 等. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844. DOI
	ZHANG J, CUI W S, ZHANG R H, et al. A text-driven 3D scene editing method based on key views[J]. Journal of Graphics, 2024, 45(4): 834-844 (in Chinese). DOI
[22]	王吉, 王森, 蒋智文, 等. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. DOI
	WANG J, WANG S, JIANG Z W, et al. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226 (in Chinese). DOI
[23]	SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2011.13456.
[24]	SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2010.02502.
[25]	DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]// The 35th International Conference on Neural Information Processing Systems. New York: ACM, 2021: 672.
[26]	AKIMOTO N, MATSUO Y, AOKI Y. Diverse plausible 360-degree image outpainting for efficient 3DCG background creation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11431-11440.
[27]	DASTJERDI M R K, HOLD-GEOFFROY Y, EISENMANN J, et al. Guided co-modulated GAN for 360° field of view extrapolation[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 475-485.
[28]	WU T H, ZHENG C X, CHAM T J. IPO-LDM:depth-aided 360-degree indoor RGB panorama outpainting via latent diffusion model[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2307.03177v1.
[29]	CHEN Z X, WANG G C, LIU Z W. Text2Light: zero-shot text-driven HDR panorama generation[J]. ACM Transactions on Graphics (TOG), 2022, 41(6): 195.
[30]	ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12868-12878.
[31]	TANG S T, ZHANG F Y, CHEN J C, et al. MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion[C]// The 37th International Conference on Neural Information Processing Systems. New York: ACM, 2023: 2229.
[32]	RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 22500-22510.
[33]	ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.08774.
[34]	ESSER P, KULAL S, BLATTMANN A, et al. Scaling rectified flow transformers for high-resolution image synthesis[C]// The 41st International Conference on Machine Learning. New York: ACM, 2024: 503.
[35]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2021.html#DosovitskiyB0WZ21.
[36]	ZHANG R R, HAN J M, LIU C, et al. LLaMA-adapter: efficient fine-tuning of language models with zero-init attention[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.16199.
[37]	KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2014.html#KingmaW13.
[38]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2103.00020.
[39]	SIFRE L, MALLAT S. Rigid-motion scattering for texture classification[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1403.1687.
[40]	ZHENG J, ZHANG J F, LI J, et al. Structured3d: a large photo-realistic dataset for structured 3D modeling[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 519-535.
[41]	YANG W Y, QIAN Y L, KÄMÄRÄINEN J K, et al. Object detection in equirectangular panorama[C]// The 24th International Conference on Pattern Recognition. New York: IEEE Press, 2018: 2190-2195.
[42]	CIRIK V, BERG-KIRKPATRICK T, MORENCY L P. Refer360°: a referring expression recognition dataset in 360° images[C]// The 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7189-7202.
[43]	DENG X, WANG H, XU M, et al. LAU-Net: latitude adaptive upscaling network for omnidirectional image super-resolution[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 9185-9194.
[44]	ZHANG Y D, SONG S R, TAN P, et al. PanoContext: a whole-room 3D context model for panoramic scene understanding[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 668-686.
[45]	CAO M D, MOU C, YU F H, et al. NTIRE 2023 challenge on 360° omnidirectional image and video super-resolution: datasets, methods and results[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 1731-1745.
[46]	ORHAN S, BASTANLAR Y. Semantic segmentation of outdoor panoramic images[J]. Signal, Image and Video Processing, 2022, 16(3): 643-650.
[47]	CHANG S H, CHIU C Y, CHANG C S, et al. Generating 360 outdoor panorama dataset with reliable sun position estimation[C]// SIGGRAPH Asia 2018 Posters. New York: ACM, 2018: 22.
[48]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2201.12086.
[49]	AKIMOTO N, KASAI S, HAYASHI M, et al. 360-degree image completion by two-stage conditional gans[C]// 2019 IEEE International Conference on Image Processing. New York: IEEE Press, 2019: 4704-4708.
[50]	HO J, SALIMANS T. Classifier-free diffusion guidance[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2207.12598.
[51]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1711.05101.
[52]	LIU L P, REN Y, LIN Z J, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2202.09778.
[53]	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826.

方法	参数量/M	FID		KID_×1000		CLIP score
方法	参数量/M	室内	室外	室内	室外	室内	室外
BitFit	0.34	24.07	24.73	11.24	6.64	22.38	22.06
Bias-Norm tuning	0.44	21.36	24.44	8.37	6.94	22.60	22.15
Adapter (dim=48)	3.63	19.56	22.13	5.84	6.11	22.42	22.11
LoRA (r=8)	3.39	20.08	22.64	6.92	5.98	22.60	22.18
Lycoris (r=2)	2.85	19.62	22.48	6.08	5.71	22.66	22.30
PanoLoRA (γ=64)	3.14	18.63	20.97	5.26	4.81	22.66	22.30

方法	参数量/M	FID		KID_×1000		CLIP score
方法	参数量/M	室内	室外	室内	室外	室内	室外
BitFit	0.34	24.07	24.73	11.24	6.64	22.38	22.06
Bias-Norm tuning	0.44	21.36	24.44	8.37	6.94	22.60	22.15
Adapter (dim=48)	3.63	19.56	22.13	5.84	6.11	22.42	22.11
LoRA (r=8)	3.39	20.08	22.64	6.92	5.98	22.60	22.18
Lycoris (r=2)	2.85	19.62	22.48	6.08	5.71	22.66	22.30
PanoLoRA (γ=64)	3.14	18.63	20.97	5.26	4.81	22.66	22.30

模块	参数量/M	FID	KID_×1000	CLIP score
PanoLoRA(default)	3.14	19.80	5.03	22.48
w/o Sphere LoRA	3.28	30.36	14.48	22.12
w/o SA Q/K LoRA	3.17	21.10	6.03	22.38
w/o 球面卷积	3.15	20.98	5.91	22.46
w/o 通道合并	3.15	20.40	5.45	22.48
w/o 复制权重	3.14	20.39	5.11	22.43

模块	参数量/M	FID	KID_×1000	CLIP score
PanoLoRA(default)	3.14	19.80	5.03	22.48
w/o Sphere LoRA	3.28	30.36	14.48	22.12
w/o SA Q/K LoRA	3.17	21.10	6.03	22.38
w/o 球面卷积	3.15	20.98	5.91	22.46
w/o 通道合并	3.15	20.40	5.45	22.48
w/o 复制权重	3.14	20.39	5.11	22.43