PanoLoRA：基于Stable Diffusion的全景图像生成的高效微调方法

doi:10.11996/JG.j.2095-302X.2025050980

图学学报 ›› 2025, Vol. 46 ›› Issue (5): 980-989.DOI: 10.11996/JG.j.2095-302X.2025050980

• 图像处理与计算机视觉 • 上一篇下一篇

PanoLoRA：基于Stable Diffusion的全景图像生成的高效微调方法

叶文龙¹^,³(), 陈斌²^,³()

¹ 北京大学地球与空间科学学院，北京 100871
² 北京大学计算机学院，北京 100871
³ 智能平行技术国家级重点实验室，北京 100871

收稿日期:2024-12-11 接受日期:2025-02-20 出版日期:2025-10-30 发布日期:2025-09-10
通讯作者:陈斌(1973-)，男，教授，博士。主要研究方向为虚拟地理环境。E-mail：gischen@pku.edu.cn
第一作者:叶文龙(2000-)，男，硕士研究生。主要研究方向为扩散模型。E-mail：2397726787@qq.com
基金资助:
国家级重点实验室基金(2024JK19)

PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion

YE Wenlong¹^,³(), CHEN Bin²^,³()

¹ School of Earth and Space Sciences, Peking University, Beijing 100871, China
² School of Computer Science, Peking University, Beijing 100871, China
³ National Key Laboratory of Intelligent Parallel Technology, Beijing 100871, China

Received:2024-12-11 Accepted:2025-02-20 Published:2025-10-30 Online:2025-09-10
First author：YE Wenlong (2000-), master student. His main research interest covers diffusion model. E-mail：2397726787@qq.com
Supported by:
Fund of National Key Laboratory(2024JK19)

摘要/Abstract

摘要：

全景图像能表达周围环境整体的信息，已成为构建虚拟场景的重要表达方式之一。但在人工智能生成内容(AIGC)技术，尤其是大规模文本-图像数据集上训练的扩散模型和高效参数微调技术(PEFT)兴起的浪潮中，全景图像的生成和快速迁移的研究却尚不充分。因此，针对全景图像数据集稀少、空间失真的特点，收集了一个总计14 000张的开源全景图像数据集，通过投影转换对其进行了精细化的文本标注与筛选，在此基础上，提出了PanoLoRA方法。该方法在原有的卷积和自注意力模块提取空间特征的过程中，额外添加了球面卷积和LoRA模块，显式地提取全景图像球面特征，并与原有平面特征相融合，从而在保留了Stable Diffusion原有的强大图文生成能力的同时，实现了全景图像生成的高效迁移学习。实验结果表明，PanoLoRA在所收集到的文本-全景图像数据集上与最新的5种参数高效微调方法进行了比较，并取得了全面的优势，提高了图像生成的质量和图文一致性，并进行了一系列消融实验，验证了每个算法模块的有效性。

关键词: 扩散模型, 全景图像, 参数高效微调, 迁移学习, LoRA

Abstract:

Panoramic images, which can express the overall information of the surrounding environment, have become an important way to construct virtual scenes. However, amidst the rise of artificial intelligence generated content (AIGC) technology, especially diffusion models trained on large-scale text image datasets and parameter-efficient fine-tuning (PEFT) techniques, research on the generation and rapid transfer of panoramic images is still insufficient. To address the challenges posed by the scarcity and spatial distortion of panoramic image datasets, 14 000 open-source panoramic image datasets were collected, finely annotated, and filtered through projection transformation. Based on this, the PanoLoRA method was proposed. In the process of extracting spatial features from the original convolution and self-attention modules, PanoLoRA additionally incorporated spherical convolution and LoRA (low-rank adaptation) modules. This enabled the explicit extraction of spherical features from panoramic images, which were then fused with the original planar features, thereby achieving efficient transfer learning for panoramic image generation while retaining the strong image generation ability of Stable Diffusion. The experimental results demonstrated that PanoLoRA outperformed the latest 5 Parameter-Efficient Fine-Tuning methods in comparison tests using the collected text panoramic image dataset, achieving comprehensive advantages and improving the quality of image generation and graphic consistency. A series of ablation experiments were conducted to verify the effectiveness of each algorithm module.

Key words: diffusion model, panoramic image, parameter-efficient fine-tuning, transfer learning, LoRA

中图分类号:

叶文龙, 陈斌. PanoLoRA：基于Stable Diffusion的全景图像生成的高效微调方法[J]. 图学学报, 2025, 46(5): 980-989.

YE Wenlong, CHEN Bin. PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion[J]. Journal of Graphics, 2025, 46(5): 980-989.

图/表 7

图1 PanoLoRA网络结构示意图

Fig. 1 The network architecture of PanoLoRA ((a) PanoLoRA; (b) Sphere LoRA; (c) Self attention Q/K LoRA; (d) LoRA)

图2 球面卷积示意图((a) 球面卷积核示意图；(b) 经线圈与卷积核示意图；(c) 球面卷积消除空间失真示意图)

Fig. 2 Sphere convolution ((a) Sphere convolution kernel; (b) Meridian circle and convolutional kernel projection; (c) Space distortion elimination)

图3 文本-全景图像数据集构建流程图

Fig. 3 The construction process of text-panoramic image dataset

图4 全景图像CLIP score计算流程

Fig. 4 The CLIP score calculation process of a panoramic image

表1 测试集图像生成的定量评估指标

Table 1 Quantitative evaluations on the test set

方法	参数量/M	FID		KID_×1000		CLIP score
方法	参数量/M	室内	室外	室内	室外	室内	室外
BitFit	0.34	24.07	24.73	11.24	6.64	22.38	22.06
Bias-Norm tuning	0.44	21.36	24.44	8.37	6.94	22.60	22.15
Adapter (dim=48)	3.63	19.56	22.13	5.84	6.11	22.42	22.11
LoRA (r=8)	3.39	20.08	22.64	6.92	5.98	22.60	22.18
Lycoris (r=2)	2.85	19.62	22.48	6.08	5.71	22.66	22.30
PanoLoRA (γ=64)	3.14	18.63	20.97	5.26	4.81	22.66	22.30

图5 在测试集的三类场景中state-of-the-art方法与PanoLoRA方法的可视化比较((a) 野外场景；(b) 城市场景；(c) 室内场景)

Fig. 5 Comparison of visualization results of 3 kinds of scenes on test set among the state-of-the-art methods and our PanoLoRA ((a) Wild; (b) Urban; (c) Indoor)

表2 各模块的消融研究

Table 2 Ablation studies of each module

模块	参数量/M	FID	KID_×1000	CLIP score
PanoLoRA(default)	3.14	19.80	5.03	22.48
w/o Sphere LoRA	3.28	30.36	14.48	22.12
w/o SA Q/K LoRA	3.17	21.10	6.03	22.38
w/o 球面卷积	3.15	20.98	5.91	22.46
w/o 通道合并	3.15	20.40	5.45	22.48
w/o 复制权重	3.14	20.39	5.11	22.43

参考文献 53

[1]	ARGYRIOU L, ECONOMOU D, BOUKI V. Design methodology for 360° immersive video applications: the case study of a cultural heritage virtual tour[J]. Personal and Ubiquitous Computing, 2020, 24(6): 843-859.
[2]	KITTEL A, LARKIN P, CUNNINGHAM I, et al. 360° virtual reality: a SWOT analysis in comparison to virtual reality[J]. Frontiers in Psychology, 2020, 11: 563474.
[3]	SOMANATH G, KURZ D. HDR environment map estimation for real-time augmented reality[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11293-11301.
[4]	KINZIG C, CORTÉS I, FERNÁNDEZ C, et al. Real-time seamless image stitching in autonomous driving[C]// 2022 25th International Conference on Information Fusion. New York: IEEE Press, 2022: 1-8.
[5]	WU S S, TANG H, JING X Y, et al. Cross-view panorama image synthesis[J]. IEEE Transactions on Multimedia, 2022, 25: 3546-3559.
[6]	FENG M Y, LIU J L, CUI M M, et al. Diffusion360: seamless 360 degree panoramic image generation based on diffusion models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2311.13141.
[7]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 574.
[8]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10674-10685.
[9]	ZAKEN E B, GOLDBERG Y, RAVFOGEL S. BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.10199.
[10]	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.09685.
[11]	HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1902.00751.
[12]	YEH S Y, HSIEH Y G, GAO Z D, et al. Navigating text-to-image customization: from LyCORIS fine-tuning to model evaluation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2309.14859
[13]	TANG N Y, FU M H, ZHU K, et al. Low-rank attention side-tuning for parameter-efficient fine-tuning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2402.04009.
[14]	COORS B, CONDURACHE A P, GEIGER A. SphereNet: learning spherical representations for detection and classification in omnidirectional images[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 525-541.
[15]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6629-6640.
[16]	BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1801.01401.
[17]	HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: a reference-free evaluation metric for image captioning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2104.08718.
[18]	RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-12-01]. https://3dvar.com/Ramesh2022Hierarchical.pdf.
[19]	SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713-4726.
[20]	BROOKS T, PEEBLES B, HOMES C, et al. Video generation models as world simulators[EB/OL]. [2024-12-01]. https://openai.com/research/video-generation-models-as-world-simulators.
[21]	张冀, 崔文帅, 张荣华, 等. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844. DOI
	ZHANG J, CUI W S, ZHANG R H, et al. A text-driven 3D scene editing method based on key views[J]. Journal of Graphics, 2024, 45(4): 834-844 (in Chinese). DOI
[22]	王吉, 王森, 蒋智文, 等. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. DOI
	WANG J, WANG S, JIANG Z W, et al. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226 (in Chinese). DOI
[23]	SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2011.13456.
[24]	SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2010.02502.
[25]	DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]// The 35th International Conference on Neural Information Processing Systems. New York: ACM, 2021: 672.
[26]	AKIMOTO N, MATSUO Y, AOKI Y. Diverse plausible 360-degree image outpainting for efficient 3DCG background creation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11431-11440.
[27]	DASTJERDI M R K, HOLD-GEOFFROY Y, EISENMANN J, et al. Guided co-modulated GAN for 360° field of view extrapolation[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 475-485.
[28]	WU T H, ZHENG C X, CHAM T J. IPO-LDM:depth-aided 360-degree indoor RGB panorama outpainting via latent diffusion model[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2307.03177v1.
[29]	CHEN Z X, WANG G C, LIU Z W. Text2Light: zero-shot text-driven HDR panorama generation[J]. ACM Transactions on Graphics (TOG), 2022, 41(6): 195.
[30]	ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12868-12878.
[31]	TANG S T, ZHANG F Y, CHEN J C, et al. MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion[C]// The 37th International Conference on Neural Information Processing Systems. New York: ACM, 2023: 2229.
[32]	RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 22500-22510.
[33]	ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.08774.
[34]	ESSER P, KULAL S, BLATTMANN A, et al. Scaling rectified flow transformers for high-resolution image synthesis[C]// The 41st International Conference on Machine Learning. New York: ACM, 2024: 503.
[35]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2021.html#DosovitskiyB0WZ21.
[36]	ZHANG R R, HAN J M, LIU C, et al. LLaMA-adapter: efficient fine-tuning of language models with zero-init attention[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.16199.
[37]	KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2014.html#KingmaW13.
[38]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2103.00020.
[39]	SIFRE L, MALLAT S. Rigid-motion scattering for texture classification[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1403.1687.
[40]	ZHENG J, ZHANG J F, LI J, et al. Structured3d: a large photo-realistic dataset for structured 3D modeling[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 519-535.
[41]	YANG W Y, QIAN Y L, KÄMÄRÄINEN J K, et al. Object detection in equirectangular panorama[C]// The 24th International Conference on Pattern Recognition. New York: IEEE Press, 2018: 2190-2195.
[42]	CIRIK V, BERG-KIRKPATRICK T, MORENCY L P. Refer360°: a referring expression recognition dataset in 360° images[C]// The 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7189-7202.
[43]	DENG X, WANG H, XU M, et al. LAU-Net: latitude adaptive upscaling network for omnidirectional image super-resolution[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 9185-9194.
[44]	ZHANG Y D, SONG S R, TAN P, et al. PanoContext: a whole-room 3D context model for panoramic scene understanding[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 668-686.
[45]	CAO M D, MOU C, YU F H, et al. NTIRE 2023 challenge on 360° omnidirectional image and video super-resolution: datasets, methods and results[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 1731-1745.
[46]	ORHAN S, BASTANLAR Y. Semantic segmentation of outdoor panoramic images[J]. Signal, Image and Video Processing, 2022, 16(3): 643-650.
[47]	CHANG S H, CHIU C Y, CHANG C S, et al. Generating 360 outdoor panorama dataset with reliable sun position estimation[C]// SIGGRAPH Asia 2018 Posters. New York: ACM, 2018: 22.
[48]	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2201.12086.
[49]	AKIMOTO N, KASAI S, HAYASHI M, et al. 360-degree image completion by two-stage conditional gans[C]// 2019 IEEE International Conference on Image Processing. New York: IEEE Press, 2019: 4704-4708.
[50]	HO J, SALIMANS T. Classifier-free diffusion guidance[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2207.12598.
[51]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1711.05101.
[52]	LIU L P, REN Y, LIN Z J, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2202.09778.
[53]	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826.

PanoLoRA：基于Stable Diffusion的全景图像生成的高效微调方法

PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 53

相关文章 13

编辑推荐

Metrics

本文评价

[1]	雷松林, 赵征鹏, 阳秋霞, 普园媛, 谷金晶, 徐丹. 基于可解耦扩散模型的零样本风格迁移[J]. 图学学报, 2025, 46(4): 727-738.
[2]	孙禾衣, 李艺潇, 田希, 张松海. 结合程序内容生成与扩散模型的图像到三维瓷瓶生成技术[J]. 图学学报, 2025, 46(2): 332-344.
[3]	李纪远, 管哲予, 宋海川, 谭鑫, 马利庄. 人在环路的细分行业logo生成方法[J]. 图学学报, 2025, 46(2): 382-392.
[4]	涂晴昊, 李元琪, 刘一凡, 过洁, 郭延文. 基于扩散模型的文本生成材质贴图的泛化性优化方法[J]. 图学学报, 2025, 46(1): 139-149.
[5]	张冀, 崔文帅, 张荣华, 王文彬, 李亚琦. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844.
[6]	王大阜, 王静, 石宇凯, 邓志文, 贾志勇. 基于深度迁移学习的图像隐私目标检测研究[J]. 图学学报, 2023, 44(6): 1112-1120.
[7]	王吉, 王森, 蒋智文, 谢志峰, 李梦甜. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226.
[8]	谢红霞, 胡毓宁, 张赟, 王亚奇, 杜辉, 秦爱红. 全景图像视频的场景分析与内容处理方法综述[J]. 图学学报, 2023, 44(4): 640-657.
[9]	范新南, 黄伟盛, 史朋飞, 辛元雪, 朱凤婷, 周润康. 基于改进 YOLOv4 的嵌入式变电站仪表检测算法[J]. 图学学报, 2022, 43(3): 396-403.
[10]	杜超，刘桂华 . 改进的 VGG 网络的二极管玻壳图像缺陷检测[J]. 图学学报, 2019, 40(6): 1087-1092.
[11]	胡彬 1,2，潘雨 1，丁卫平 1，邵叶秦 3，杨铖 1. 基于迁移学习的行人再识别[J]. 图学学报, 2018, 39(5): 886-891.
[12]	唐爱平. 管道内壁全景图像自适应展开算法研究[J]. 图学学报, 2015, 36(6): 973-978.
[13]	邓松杰，周松斌，程韬波. 利用鱼眼镜头生成全景图像的方法[J]. 图学学报, 2010, 31(1): 135-138.