图学学报 ›› 2025, Vol. 46 ›› Issue (5): 980-989.DOI: 10.11996/JG.j.2095-302X.2025050980
收稿日期:
2024-12-11
接受日期:
2025-02-20
出版日期:
2025-10-30
发布日期:
2025-09-10
通讯作者:
陈斌(1973-),男,教授,博士。主要研究方向为虚拟地理环境。E-mail:gischen@pku.edu.cn第一作者:
叶文龙(2000-),男,硕士研究生。主要研究方向为扩散模型。E-mail:2397726787@qq.com
基金资助:
YE Wenlong1,3(), CHEN Bin2,3(
)
Received:
2024-12-11
Accepted:
2025-02-20
Published:
2025-10-30
Online:
2025-09-10
First author:
YE Wenlong (2000-), master student. His main research interest covers diffusion model. E-mail:2397726787@qq.com
Supported by:
摘要:
全景图像能表达周围环境整体的信息,已成为构建虚拟场景的重要表达方式之一。但在人工智能生成内容(AIGC)技术,尤其是大规模文本-图像数据集上训练的扩散模型和高效参数微调技术(PEFT)兴起的浪潮中,全景图像的生成和快速迁移的研究却尚不充分。因此,针对全景图像数据集稀少、空间失真的特点,收集了一个总计14 000张的开源全景图像数据集,通过投影转换对其进行了精细化的文本标注与筛选,在此基础上,提出了PanoLoRA方法。该方法在原有的卷积和自注意力模块提取空间特征的过程中,额外添加了球面卷积和LoRA模块,显式地提取全景图像球面特征,并与原有平面特征相融合,从而在保留了Stable Diffusion原有的强大图文生成能力的同时,实现了全景图像生成的高效迁移学习。实验结果表明,PanoLoRA在所收集到的文本-全景图像数据集上与最新的5种参数高效微调方法进行了比较,并取得了全面的优势,提高了图像生成的质量和图文一致性,并进行了一系列消融实验,验证了每个算法模块的有效性。
中图分类号:
叶文龙, 陈斌. PanoLoRA:基于Stable Diffusion的全景图像生成的高效微调方法[J]. 图学学报, 2025, 46(5): 980-989.
YE Wenlong, CHEN Bin. PanoLoRA: an efficient finetuning method for panoramic image generation based on Stable Diffusion[J]. Journal of Graphics, 2025, 46(5): 980-989.
图2 球面卷积示意图((a) 球面卷积核示意图;(b) 经线圈与卷积核示意图;(c) 球面卷积消除空间失真示意图)
Fig. 2 Sphere convolution ((a) Sphere convolution kernel; (b) Meridian circle and convolutional kernel projection; (c) Space distortion elimination)
方法 | 参数量/M | FID | KID×1000 | CLIP score | |||
---|---|---|---|---|---|---|---|
室内 | 室外 | 室内 | 室外 | 室内 | 室外 | ||
BitFit | 0.34 | 24.07 | 24.73 | 11.24 | 6.64 | 22.38 | 22.06 |
Bias-Norm tuning | 0.44 | 21.36 | 24.44 | 8.37 | 6.94 | 22.60 | 22.15 |
Adapter (dim=48) | 3.63 | 19.56 | 22.13 | 5.84 | 6.11 | 22.42 | 22.11 |
LoRA (r=8) | 3.39 | 20.08 | 22.64 | 6.92 | 5.98 | 22.60 | 22.18 |
Lycoris (r=2) | 2.85 | 19.62 | 22.48 | 6.08 | 5.71 | 22.66 | 22.30 |
PanoLoRA (γ=64) | 3.14 | 18.63 | 20.97 | 5.26 | 4.81 | 22.66 | 22.30 |
表1 测试集图像生成的定量评估指标
Table 1 Quantitative evaluations on the test set
方法 | 参数量/M | FID | KID×1000 | CLIP score | |||
---|---|---|---|---|---|---|---|
室内 | 室外 | 室内 | 室外 | 室内 | 室外 | ||
BitFit | 0.34 | 24.07 | 24.73 | 11.24 | 6.64 | 22.38 | 22.06 |
Bias-Norm tuning | 0.44 | 21.36 | 24.44 | 8.37 | 6.94 | 22.60 | 22.15 |
Adapter (dim=48) | 3.63 | 19.56 | 22.13 | 5.84 | 6.11 | 22.42 | 22.11 |
LoRA (r=8) | 3.39 | 20.08 | 22.64 | 6.92 | 5.98 | 22.60 | 22.18 |
Lycoris (r=2) | 2.85 | 19.62 | 22.48 | 6.08 | 5.71 | 22.66 | 22.30 |
PanoLoRA (γ=64) | 3.14 | 18.63 | 20.97 | 5.26 | 4.81 | 22.66 | 22.30 |
图5 在测试集的三类场景中state-of-the-art方法与PanoLoRA方法的可视化比较((a) 野外场景;(b) 城市场景;(c) 室内场景)
Fig. 5 Comparison of visualization results of 3 kinds of scenes on test set among the state-of-the-art methods and our PanoLoRA ((a) Wild; (b) Urban; (c) Indoor)
模块 | 参数量/M | FID | KID×1000 | CLIP score |
---|---|---|---|---|
PanoLoRA(default) | 3.14 | 19.80 | 5.03 | 22.48 |
w/o Sphere LoRA | 3.28 | 30.36 | 14.48 | 22.12 |
w/o SA Q/K LoRA | 3.17 | 21.10 | 6.03 | 22.38 |
w/o 球面卷积 | 3.15 | 20.98 | 5.91 | 22.46 |
w/o 通道合并 | 3.15 | 20.40 | 5.45 | 22.48 |
w/o 复制权重 | 3.14 | 20.39 | 5.11 | 22.43 |
表2 各模块的消融研究
Table 2 Ablation studies of each module
模块 | 参数量/M | FID | KID×1000 | CLIP score |
---|---|---|---|---|
PanoLoRA(default) | 3.14 | 19.80 | 5.03 | 22.48 |
w/o Sphere LoRA | 3.28 | 30.36 | 14.48 | 22.12 |
w/o SA Q/K LoRA | 3.17 | 21.10 | 6.03 | 22.38 |
w/o 球面卷积 | 3.15 | 20.98 | 5.91 | 22.46 |
w/o 通道合并 | 3.15 | 20.40 | 5.45 | 22.48 |
w/o 复制权重 | 3.14 | 20.39 | 5.11 | 22.43 |
[1] | ARGYRIOU L, ECONOMOU D, BOUKI V. Design methodology for 360° immersive video applications: the case study of a cultural heritage virtual tour[J]. Personal and Ubiquitous Computing, 2020, 24(6): 843-859. |
[2] | KITTEL A, LARKIN P, CUNNINGHAM I, et al. 360° virtual reality: a SWOT analysis in comparison to virtual reality[J]. Frontiers in Psychology, 2020, 11: 563474. |
[3] | SOMANATH G, KURZ D. HDR environment map estimation for real-time augmented reality[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 11293-11301. |
[4] | KINZIG C, CORTÉS I, FERNÁNDEZ C, et al. Real-time seamless image stitching in autonomous driving[C]// 2022 25th International Conference on Information Fusion. New York: IEEE Press, 2022: 1-8. |
[5] | WU S S, TANG H, JING X Y, et al. Cross-view panorama image synthesis[J]. IEEE Transactions on Multimedia, 2022, 25: 3546-3559. |
[6] | FENG M Y, LIU J L, CUI M M, et al. Diffusion360: seamless 360 degree panoramic image generation based on diffusion models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2311.13141. |
[7] | HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 574. |
[8] | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10674-10685. |
[9] | ZAKEN E B, GOLDBERG Y, RAVFOGEL S. BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.10199. |
[10] | HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2106.09685. |
[11] | HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1902.00751. |
[12] | YEH S Y, HSIEH Y G, GAO Z D, et al. Navigating text-to-image customization: from LyCORIS fine-tuning to model evaluation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2309.14859 |
[13] | TANG N Y, FU M H, ZHU K, et al. Low-rank attention side-tuning for parameter-efficient fine-tuning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2402.04009. |
[14] | COORS B, CONDURACHE A P, GEIGER A. SphereNet: learning spherical representations for detection and classification in omnidirectional images[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 525-541. |
[15] | HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]// The 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6629-6640. |
[16] | BIŃKOWSKI M, SUTHERLAND D J, ARBEL M, et al. Demystifying MMD GANs[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1801.01401. |
[17] | HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: a reference-free evaluation metric for image captioning[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2104.08718. |
[18] | RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. [2024-12-01]. https://3dvar.com/Ramesh2022Hierarchical.pdf. |
[19] | SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713-4726. |
[20] | BROOKS T, PEEBLES B, HOMES C, et al. Video generation models as world simulators[EB/OL]. [2024-12-01]. https://openai.com/research/video-generation-models-as-world-simulators. |
[21] |
张冀, 崔文帅, 张荣华, 等. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844.
DOI |
ZHANG J, CUI W S, ZHANG R H, et al. A text-driven 3D scene editing method based on key views[J]. Journal of Graphics, 2024, 45(4): 834-844 (in Chinese).
DOI |
|
[22] |
王吉, 王森, 蒋智文, 等. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226.
DOI |
WANG J, WANG S, JIANG Z W, et al. Zero-shot text-driven avatar generation based on depth-conditioned diffusion model[J]. Journal of Graphics, 2023, 44(6): 1218-1226 (in Chinese).
DOI |
|
[23] | SONG Y, SOHL-DICKSTEIN J, KINGMA D P, et al. Score-based generative modeling through stochastic differential equations[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2011.13456. |
[24] | SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2010.02502. |
[25] | DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]// The 35th International Conference on Neural Information Processing Systems. New York: ACM, 2021: 672. |
[26] | AKIMOTO N, MATSUO Y, AOKI Y. Diverse plausible 360-degree image outpainting for efficient 3DCG background creation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11431-11440. |
[27] | DASTJERDI M R K, HOLD-GEOFFROY Y, EISENMANN J, et al. Guided co-modulated GAN for 360° field of view extrapolation[C]// 2022 International Conference on 3D Vision. New York: IEEE Press, 2022: 475-485. |
[28] | WU T H, ZHENG C X, CHAM T J. IPO-LDM:depth-aided 360-degree indoor RGB panorama outpainting via latent diffusion model[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2307.03177v1. |
[29] | CHEN Z X, WANG G C, LIU Z W. Text2Light: zero-shot text-driven HDR panorama generation[J]. ACM Transactions on Graphics (TOG), 2022, 41(6): 195. |
[30] | ESSER P, ROMBACH R, OMMER B. Taming transformers for high-resolution image synthesis[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 12868-12878. |
[31] | TANG S T, ZHANG F Y, CHEN J C, et al. MVDiffusion: enabling holistic multi-view image generation with correspondence-aware diffusion[C]// The 37th International Conference on Neural Information Processing Systems. New York: ACM, 2023: 2229. |
[32] | RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 22500-22510. |
[33] | ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.08774. |
[34] | ESSER P, KULAL S, BLATTMANN A, et al. Scaling rectified flow transformers for high-resolution image synthesis[C]// The 41st International Conference on Machine Learning. New York: ACM, 2024: 503. |
[35] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2021.html#DosovitskiyB0WZ21. |
[36] | ZHANG R R, HAN J M, LIU C, et al. LLaMA-adapter: efficient fine-tuning of language models with zero-init attention[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2303.16199. |
[37] | KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2024-12-01]. https://dblp.org/db/conf/iclr/iclr2014.html#KingmaW13. |
[38] | RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2103.00020. |
[39] | SIFRE L, MALLAT S. Rigid-motion scattering for texture classification[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1403.1687. |
[40] | ZHENG J, ZHANG J F, LI J, et al. Structured3d: a large photo-realistic dataset for structured 3D modeling[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 519-535. |
[41] | YANG W Y, QIAN Y L, KÄMÄRÄINEN J K, et al. Object detection in equirectangular panorama[C]// The 24th International Conference on Pattern Recognition. New York: IEEE Press, 2018: 2190-2195. |
[42] | CIRIK V, BERG-KIRKPATRICK T, MORENCY L P. Refer360°: a referring expression recognition dataset in 360° images[C]// The 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7189-7202. |
[43] | DENG X, WANG H, XU M, et al. LAU-Net: latitude adaptive upscaling network for omnidirectional image super-resolution[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 9185-9194. |
[44] | ZHANG Y D, SONG S R, TAN P, et al. PanoContext: a whole-room 3D context model for panoramic scene understanding[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 668-686. |
[45] | CAO M D, MOU C, YU F H, et al. NTIRE 2023 challenge on 360° omnidirectional image and video super-resolution: datasets, methods and results[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 1731-1745. |
[46] | ORHAN S, BASTANLAR Y. Semantic segmentation of outdoor panoramic images[J]. Signal, Image and Video Processing, 2022, 16(3): 643-650. |
[47] | CHANG S H, CHIU C Y, CHANG C S, et al. Generating 360 outdoor panorama dataset with reliable sun position estimation[C]// SIGGRAPH Asia 2018 Posters. New York: ACM, 2018: 22. |
[48] | LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2201.12086. |
[49] | AKIMOTO N, KASAI S, HAYASHI M, et al. 360-degree image completion by two-stage conditional gans[C]// 2019 IEEE International Conference on Image Processing. New York: IEEE Press, 2019: 4704-4708. |
[50] | HO J, SALIMANS T. Classifier-free diffusion guidance[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2207.12598. |
[51] | LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/1711.05101. |
[52] | LIU L P, REN Y, LIN Z J, et al. Pseudo numerical methods for diffusion models on manifolds[EB/OL]. [2024-12-01]. https://arxiv.org/pdf/2202.09778. |
[53] | SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2818-2826. |
[1] | 雷松林, 赵征鹏, 阳秋霞, 普园媛, 谷金晶, 徐丹. 基于可解耦扩散模型的零样本风格迁移[J]. 图学学报, 2025, 46(4): 727-738. |
[2] | 孙禾衣, 李艺潇, 田希, 张松海. 结合程序内容生成与扩散模型的图像到三维瓷瓶生成技术[J]. 图学学报, 2025, 46(2): 332-344. |
[3] | 李纪远, 管哲予, 宋海川, 谭鑫, 马利庄. 人在环路的细分行业logo生成方法[J]. 图学学报, 2025, 46(2): 382-392. |
[4] | 涂晴昊, 李元琪, 刘一凡, 过洁, 郭延文. 基于扩散模型的文本生成材质贴图的泛化性优化方法[J]. 图学学报, 2025, 46(1): 139-149. |
[5] | 张冀, 崔文帅, 张荣华, 王文彬, 李亚琦. 基于关键视图的文本驱动3D场景编辑方法[J]. 图学学报, 2024, 45(4): 834-844. |
[6] | 王大阜, 王静, 石宇凯, 邓志文, 贾志勇. 基于深度迁移学习的图像隐私目标检测研究[J]. 图学学报, 2023, 44(6): 1112-1120. |
[7] | 王吉, 王森, 蒋智文, 谢志峰, 李梦甜. 基于深度条件扩散模型的零样本文本驱动虚拟人生成方法[J]. 图学学报, 2023, 44(6): 1218-1226. |
[8] | 谢红霞, 胡毓宁, 张赟, 王亚奇, 杜辉, 秦爱红. 全景图像视频的场景分析与内容处理方法综述[J]. 图学学报, 2023, 44(4): 640-657. |
[9] | 范新南, 黄伟盛, 史朋飞, 辛元雪, 朱凤婷, 周润康. 基于改进 YOLOv4 的嵌入式变电站仪表检测算法[J]. 图学学报, 2022, 43(3): 396-403. |
[10] | 杜 超, 刘桂华 . 改进的 VGG 网络的二极管玻壳图像缺陷检测[J]. 图学学报, 2019, 40(6): 1087-1092. |
[11] | 胡 彬 1,2, 潘 雨 1, 丁卫平 1, 邵叶秦 3, 杨 铖 1. 基于迁移学习的行人再识别[J]. 图学学报, 2018, 39(5): 886-891. |
[12] | 唐爱平. 管道内壁全景图像自适应展开算法研究[J]. 图学学报, 2015, 36(6): 973-978. |
[13] | 邓松杰, 周松斌, 程韬波. 利用鱼眼镜头生成全景图像的方法[J]. 图学学报, 2010, 31(1): 135-138. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||