欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 332-340.DOI: 10.11996/JG.j.2095-302X.2026020332

• 图像处理与计算机视觉 • 上一篇    下一篇

融合语义特征的全景图像质量评估

包永堂, 王谟钦, 王智慧, 马光晓()   

  1. 山东科技大学计算机科学与工程学院山东 青岛 266590
  • 收稿日期:2025-10-09 接受日期:2025-12-06 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:马光晓,E-mail:mgx@sdust.edu.cn
  • 基金资助:
    山东省自然科学基金(ZR2024QF267);青岛市自然科学基金(24-4-4-zrjj-90-jch);青岛市自然科学基金(24-4-4-zrjj-126-jch);青岛市科技惠民计划项目(25-1-5-smjk-18-nsh)

Perceptually-aligned panoramic image quality assessment via global semantic feature fusion

BAO Yongtang, WANG Moqin, WANG Zhihui, MA Guangxiao()   

  1. College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao Shandong 266590, China
  • Received:2025-10-09 Accepted:2025-12-06 Published:2026-04-30 Online:2026-05-20
  • Contact: MA Guangxiao,E-mail:mgx@sdust.edu.cn
  • Supported by:
    Shandong Provincial Natural Science Foundation(ZR2024QF267);Qingdao Natural Science Foundation(24-4-4-zrjj-90-jch);Qingdao Natural Science Foundation(24-4-4-zrjj-126-jch);Qingdao Science and Technology Benefits the People Project(25-1-5-smjk-18-nsh)

摘要:

全景图像质量评估旨在客观反映沉浸式视觉内容的主观感知质量。然而,现有深度学习模型在该任务中,常因过度依赖底层失真特征而导致其客观预测与人类主观感知存在显著偏差。为解决这一关键问题,提出一种新颖的层级式语义引导网络,其核心在于模拟人类视觉系统中“自顶向下”的认知机制。当前主流方法多遵循“自底向上”的范式,即从像素级特征中聚合质量分数,由于过程缺乏对图像全局结构、构图美学等高级语义信息的有效整合,从而限制了其性能上界。为此,该框架构建了一个双路并行信息处理体系,其核心在于“自顶向下”的语义注意力调制机制。在该体系中,语义先验通路利用视觉语言模型将输入图像解析为一个结构化的语义嵌入向量;与此同时,视觉表征通路通过深度卷积网络提取多尺度特征图。其设计的调制机制以语义嵌入向量为条件输入,生成动态注意力权重,对视觉通路中的多尺度特征进行实时重标定。并使得整个特征提取过程都能受到高级语义的引导,从而聚焦于人类主观判断的关键信息。为确保模型预测在排序关系上与人类感知保持一致,整个框架通过一个结合了列表排序损失的复合目标函数进行端到端优化。在CVIQD,OIQA和OIQ-10K的3个公开基准数据集上的综合实验结果表明,该框架的性能显著优于现有前沿方法,验证了该语义引导范式在提升感知质量评估任务上的有效性与先进性。

关键词: 全景图像质量评估, 感知一致性, 视觉语言模型, 无参考质量评估

Abstract:

Panoramic Image Quality Assessment aims to objectively reflect the subjective perceptual quality of immersive visual content. However, a significant discrepancy often exists between the objective predictions of current deep learning models and human subjective perception, primarily due to an over-reliance on low-level distortion features. To address this critical issue, a novel Hierarchical Semantic-Guided Network, was proposed, which emulated the “top-down” cognitive mechanism inherent in the human visual system. Prevailing methods predominantly follow a “bottom-up” paradigm, aggregating quality scores from pixel-level features. however, this process often fails to effectively integrate high-level semantic information such as global composition and aesthetic attributes, thereby limiting the performance ceiling. To this end, a dual-path parallel information processing architecture was constructed, centered around a “top-down” semantic attention modulation mechanism. Within this architecture, a semantic prior path leveraged a Vision-Language Model to parse the input image into a structured semantic embedding. Concurrently, a visual representation path extracted multi-scale feature maps using a deep convolutional network. The designed modulation mechanism utilized the semantic embedding as a conditional input to generate dynamic attention weights, which performed real-time recalibration of the multi-scale features in the visual path. This design ensured that the entire feature extraction process was guided by high-level semantics, thereby focusing on information most critical to human subjective judgment. To ensure the ordinal relationship of the model’s predictions aligns with human perception, the entire framework was optimized end-to-end via a composite objective function that incorporated a listwise ranking loss. Comprehensive experiments on three public benchmark datasets, CVIQD, OIQA, and OIQ-10K, demonstrated that the proposed framework significantly outperformed state-of-the-art methods, validating the effectiveness and novelty of the semantic-guided paradigm in advancing perceptual quality assessment tasks.

Key words: panoramic image quality assessment, perceptual alignment, vision-language model, no-reference quality assessment

中图分类号: