欢迎访问《图学学报》

图学学报 ›› 2026, Vol. 47 ›› Issue (2): 286-295.DOI: 10.11996/JG.j.2095-302X.2026020286

• 图像处理与计算机视觉 • 上一篇    下一篇

基于图拓扑特征提取的跨模态一致性检测方法

房友江, 王世豪, 张亮, 段可然, 刘越, 魏小鹏, 杨鑫()   

  1. 大连理工大学社会计算与认知智能教育部重点实验室辽宁 大连 116024
  • 收稿日期:2025-06-03 接受日期:2025-12-13 出版日期:2026-04-30 发布日期:2026-05-20
  • 通讯作者:杨鑫,E-mail:xinyang@dlut.edu.cn
  • 基金资助:
    教育部人文社会科学研究项目青年基金项目(22YJCZH116)

Cross-modal consistency detection via graph topological feature extraction

FANG Youjiang, WANG Shihao, ZHANG Liang, DUAN Keran, LIU Yue, WEI Xiaopeng, YANG Xin()   

  1. Key Laboratory of Social Computing and Cognitive Intelligence, Ministry of Education, Dalian University of Technology, Dalian Liaoning 116024, China
  • Received:2025-06-03 Accepted:2025-12-13 Published:2026-04-30 Online:2026-05-20
  • Contact: YANG Xin,E-mail:xinyang@dlut.edu.cn
  • Supported by:
    The Ministry of Education Humanities and Social Sciences Research Project Youth Fund Project(22YJCZH116)

摘要:

随着社交媒体的迅猛发展,海量多模态信息在网络舆情事件中广泛传播,自动化舆情监测技术成为社会治理与风险预警的重要手段。讽刺、隐喻等复杂表达在舆情信息中频繁出现,其本质常体现为文本与视觉信息间的不一致性,极大增加了自动识别的难度。现有跨模态一致性检测方法在对单模态与跨模态信息的结构化建模、深层语义理解等方面仍存在不足,影响了对真实舆情态势的精准把控。针对上述问题,提出一种图结构感知跨模态舆情网络(GCPNet)。首先,利用CLIP模型作为特征编码器,并以文本单词和图像块为节点构建全连接图拓扑结构,通过图卷积网络(GCNs)显式挖掘并增强多模态信息内部的语义与结构关联。其次,设计分层交互式注意力图模块,通过细粒度交叉注意力对齐、全局自适应门控融合以及动态图结构增强3个阶段,提升对复杂上下文的全局建模与深层交互能力。最后,采用自适应加权融合策略,动态整合单模态结构化特征与跨模态交互特征。在公开基准数据集MMSD2.0上的实验结果表明,GCPNet能够精准捕捉跨模态一致性线索,有效识别讽刺、隐喻等复杂舆情内容,在准确性和鲁棒性方面均优于现有主流方法。该研究为多模态舆情监测(MPOM)任务提供了新的技术路径和理论支撑,也为实际舆情治理和社会安全保障提供了有力工具。

关键词: 多模态舆情监测, 图拓扑提取, 跨模态特征融合, 注意力机制, 多模态融合

Abstract:

With the rapid development of social media, massive multimodal content is extensively disseminated during public opinion events, making automated public opinion monitoring a critical tool for social governance and early risk warning. Complex linguistic expressions such as sarcasm and metaphor frequently appear in online discourse and are often characterized by inconsistencies between textual and visual modalities, which significantly complicates automatic detection. Existing cross-modal consistency detection methods face limitations in structurally modeling unimodal and multimodal information and in capturing deep semantic correlations, hindering the precise control of real-world public opinion trends. To address these issues, a Graph-structure-aware Cross-modal Public Opinion Network (GCPNet) is proposed. First, the CLIP (Contrastive Language-Image Pretraining) model was utilized as a feature encoder, and fully connected graph topological structures were constructed with textual words and image patches as nodes. Graph Convolutional Networks (GCNs) were employed to explicitly mine and enhance the semantic and structural correlations within multimodal information. Second, a hierarchical interactive attention graph module was designed to improve global modeling and deep interaction capabilities for complex contexts through three stages: fine-grained cross-attention alignment, global adaptive gating fusion, and dynamic graph structure enhancement. Finally, an adaptive weighted fusion strategy was adopted to dynamically integrate unimodal structured features and cross-modal interactive features. Experimental results on the public benchmark dataset MMSD2.0 show that GCPNet accurately captured cross-modal consistency cues and effectively identified complex public opinion content such as sarcasm and metaphor, outperforming existing state-of-the-art methods in terms of accuracy and robustness. This research provides a new methodological pathway and theoretical foundation for multimodal public opinion understanding, offering a practical tool for real-world opinion governance and social risk mitigation.

Key words: multimodal public opinion monitoring, graph topology extraction, cross-modal feature fusion, attention mechanism, multimodal fusion

中图分类号: