基于图拓扑特征提取的跨模态一致性检测方法

doi:10.11996/JG.j.2095-302X.2026020286

摘要/Abstract

摘要：

随着社交媒体的迅猛发展，海量多模态信息在网络舆情事件中广泛传播，自动化舆情监测技术成为社会治理与风险预警的重要手段。讽刺、隐喻等复杂表达在舆情信息中频繁出现，其本质常体现为文本与视觉信息间的不一致性，极大增加了自动识别的难度。现有跨模态一致性检测方法在对单模态与跨模态信息的结构化建模、深层语义理解等方面仍存在不足，影响了对真实舆情态势的精准把控。针对上述问题，提出一种图结构感知跨模态舆情网络(GCPNet)。首先，利用CLIP模型作为特征编码器，并以文本单词和图像块为节点构建全连接图拓扑结构，通过图卷积网络(GCNs)显式挖掘并增强多模态信息内部的语义与结构关联。其次，设计分层交互式注意力图模块，通过细粒度交叉注意力对齐、全局自适应门控融合以及动态图结构增强3个阶段，提升对复杂上下文的全局建模与深层交互能力。最后，采用自适应加权融合策略，动态整合单模态结构化特征与跨模态交互特征。在公开基准数据集MMSD2.0上的实验结果表明，GCPNet能够精准捕捉跨模态一致性线索，有效识别讽刺、隐喻等复杂舆情内容，在准确性和鲁棒性方面均优于现有主流方法。该研究为多模态舆情监测(MPOM)任务提供了新的技术路径和理论支撑，也为实际舆情治理和社会安全保障提供了有力工具。

关键词: 多模态舆情监测, 图拓扑提取, 跨模态特征融合, 注意力机制, 多模态融合

Abstract:

With the rapid development of social media, massive multimodal content is extensively disseminated during public opinion events, making automated public opinion monitoring a critical tool for social governance and early risk warning. Complex linguistic expressions such as sarcasm and metaphor frequently appear in online discourse and are often characterized by inconsistencies between textual and visual modalities, which significantly complicates automatic detection. Existing cross-modal consistency detection methods face limitations in structurally modeling unimodal and multimodal information and in capturing deep semantic correlations, hindering the precise control of real-world public opinion trends. To address these issues, a Graph-structure-aware Cross-modal Public Opinion Network (GCPNet) is proposed. First, the CLIP (Contrastive Language-Image Pretraining) model was utilized as a feature encoder, and fully connected graph topological structures were constructed with textual words and image patches as nodes. Graph Convolutional Networks (GCNs) were employed to explicitly mine and enhance the semantic and structural correlations within multimodal information. Second, a hierarchical interactive attention graph module was designed to improve global modeling and deep interaction capabilities for complex contexts through three stages: fine-grained cross-attention alignment, global adaptive gating fusion, and dynamic graph structure enhancement. Finally, an adaptive weighted fusion strategy was adopted to dynamically integrate unimodal structured features and cross-modal interactive features. Experimental results on the public benchmark dataset MMSD2.0 show that GCPNet accurately captured cross-modal consistency cues and effectively identified complex public opinion content such as sarcasm and metaphor, outperforming existing state-of-the-art methods in terms of accuracy and robustness. This research provides a new methodological pathway and theoretical foundation for multimodal public opinion understanding, offering a practical tool for real-world opinion governance and social risk mitigation.

Key words: multimodal public opinion monitoring, graph topology extraction, cross-modal feature fusion, attention mechanism, multimodal fusion

中图分类号:

房友江, 王世豪, 张亮, 段可然, 刘越, 魏小鹏, 杨鑫. 基于图拓扑特征提取的跨模态一致性检测方法[J]. 图学学报, 2026, 47(2): 286-295.

FANG Youjiang, WANG Shihao, ZHANG Liang, DUAN Keran, LIU Yue, WEI Xiaopeng, YANG Xin. Cross-modal consistency detection via graph topological feature extraction[J]. Journal of Graphics, 2026, 47(2): 286-295.

图/表 6

图1 多模态舆情交互示例((a) 讽刺示例；(b) 非讽刺示例)

Fig. 1 Multimodal public opinion interaction examples ((a) Sarcasm examples; (b) No-sarcasm examples)

图2 所提出的GCPNet的架构，包括图拓扑提取和增强模块和分层交互式注意力图模块

Fig. 2 The architecture of the proposed GCPNet, which includes the graph topology extraction and enhancement module and the hierarchical interactive attention graph module

表1 MMSD2.0的数据集统计

Table 1 Dataset Statistics for MMSD2.0

参数	Train	Validation	Test
All	19 812	2 410	2 409
Positive	9 572	1 042	1 037
Negative	10 240	1 368	1 372

表2 MMSD2.0的实验结果/%

Table 2 Experimental results on the MMSD2.0 datasets/%

模态	模型	Acc	P	R	F1
文本模态	TextCNN ^[32]	71.61	64.62	75.22	69.52
	Bi-LSTM ^[33]	72.48	68.02	68.08	68.05
	SMSD ^[34]	73.56	68.45	71.55	69.97
	RoBERTa ^[35]	79.66	76.74	75.70	76.21
图像模态	ResNet ^[36]	65.50	61.17	54.39	57.58
图像模态	ViT ^[37]	72.02	65.26	74.83	69.72
跨模态	HFM ^[22]	70.57	64.84	69.05	66.88
	Att-BERT ^[20]	80.03	76.28	77.82	77.04
	CMGCN ^[23]	79.83	75.82	78.01	76.90
	HKE ^[24]	76.50	73.48	71.07	72.25
	DIP ^[1]	80.59	75.52	81.14	78.23
	DynRT ^[11]	70.37	63.02	75.15	68.55
	G²SAM ^[12]	79.43	72.04	85.20	78.04
	Multi-view CLIP ^[31]	85.64	80.33	88.24	84.10
	DAIE ^[38]	84.33	82.43	81.91	82.17
本文	GCPNet (ours)	86.06	83.74	87.15	85.41

图3 消融实验结果

Fig. 3 Ablation experiment results

图4 跨模态图拓扑特征提取案例((a) 讽刺舆情样本输入；(b) 图像分块与全连接图构建过程；(c) GCN增强后的拓扑激活状态(深红色区域表示高权重的关键冲突线索))

Fig. 4 Case study of graph topology feature extraction ((a) Input sarcasm sample; (b) Image patching and fully connected graph construction; (c) GCN-enhanced topological activation (darker red indicates high-weighted key conflicting cues))

参考文献 38

[1]	WEN C S, JIA G L, YANG J F. DIP: dual incongruity perceiving network for sarcasm detection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2540-2550.
[2]	VERMA P, SHUKLA N, SHUKLA A P. Techniques of sarcasm detection: a review[C]// 2021 International Conference on Advance Computing and Innovative Technologies in Engineering. New York: IEEE Press, 2021: 968-972.
[3]	GODARA J, ARON R, SHABAZ M. Sentiment analysis and sarcasm detection from social network to train health-care professionals[J]. World Journal of Engineering, 2022, 19(1): 124-133. DOI URL
[4]	LI J N, PAN H L, LIN Z, et al. Sarcasm detection with commonsense knowledge[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3192-3201. DOI URL
[5]	RAO M V, C S. Detection of sarcasm on amazon product reviews using machine learning algorithms under sentiment analysis[C]// The 6th International Conference on Wireless Communications, Signal Processing and Networking. New York: IEEE Press, 2021: 196-199.
[6]	ZHANG Y Z, WANG J L, LIU Y C, et al. A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations[J]. Information Fusion, 2023, 93: 282-301. DOI URL
[7]	DUTTA P, BHATTACHARYYA C K. Multi-modal sarcasm detection in social networks: a comparative review[C]// The 6th International Conference on Computing Methodologies and Communication. New York: IEEE Press, 2022: 207-214.
[8]	SCHIFANELLA R, DE JUAN P, TETREAULT J, et al. Detecting sarcasm in multimodal social platforms[C]// The 24th ACM International Conference on Multimedia. New York: ACM, 2016: 1136-1145.
[9]	XU N, ZENG Z X, MAO W J. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[EB/OL]. [2025-04-03]. https://aclanthology.org/2020.acl-main.349/.
[10]	LIANG B, LOU C W, LI X, et al. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4707-4715.
[11]	TIAN Y, XU N, ZHANG R K, et al. Dynamic routing transformer network for multimodal sarcasm detection[EB/OL]. [2025-04-03]. https://aclanthology.org/2023.acl-long.139/.
[12]	WEI Y W, YUAN S Z, ZHOU H Y, et al. G²SAM: graph-based global semantic awareness method for multimodal sarcasm detection[C]// The 38th AAAI Conference on Artificial Intelligence. Washington: AAAI Press, 2024: 9151-9159.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-04-03]. https://proceedings.mlr.press/v139/radford21a.
[14]	甘宇祥, 王亚博, 薛均晓, 等. 基于情感特征的新冠肺炎疫情舆情演化分析[J]. 图学学报, 2021, 42(2): 222-229.
	GAN Y X, WANG Y B, XUE J X, et al. Public opinion evolution analysis of “COVID-19 epidemic” based on sentiment feature[J]. Journal of Graphics, 2021, 42(2): 222-229 (in Chinese).
[15]	黄欢, 孙力娟, 曹莹, 等. 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14.
	HUANG H, SUN L J, CAO Y, et al. Multimodal sentiment analysis of short videos based on attention[J]. Journal of Graphics, 2021, 42(1): 8-14 (in Chinese). DOI
[16]	ALQAHTANI A, ALHENAKI L, ALSHEDDI A. Text-based sarcasm detection on social networks: a systematic review[J]. International Journal of Advanced Computer Science and Applications, 2023, 14(3): 313-328.
[17]	SHRIVASTAVA M, KUMAR S. A pragmatic and intelligent model for sarcasm detection in social media text[J]. Technology in Society, 2021, 64: 101489. DOI URL
[18]	GUPTA S, SINGH R, SINGLA V. Emoticon and text sarcasm detection in sentiment analysis[C]// The 1st International Conference on Sustainable Technologies for Computational Intelligence. Cham: Springer, 2020: 1-10.
[19]	LIU J, TIAN S W, YU L, et al. Image-text fusion transformer network for sarcasm detection[J]. Multimedia Tools and Applications, 2024, 83(14): 41895-41909. DOI
[20]	PAN H L, LIN Z, FU P, et al. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[EB/OL]. [2025-04-03]. https://aclanthology.org/2020.findings-emnlp.124/.
[21]	SANGWAN S, AKHTAR M S, BEHERA P, et al. I didn’t mean what I wrote! Exploring multimodality for sarcasm detection[C]// 2020 International Joint Conference on Neural Networks. New York: IEEE Press, 2020: 1-8.
[22]	CAI Y T, CAI H Y, WAN X J. Multi-modal sarcasm detection in twitter with hierarchical fusion model[EB/OL]. [2025-04-03]. https://aclanthology.org/P19-1239/.
[23]	LIANG B, LOU C W, LI X, et al. Multi-modal sarcasm detection via cross-modal graph convolutional network[EB/OL]. [2025-04-03]. https://aclanthology.org/2022.acl-long.124/.
[24]	LIU H, WANG W Y, LI H L. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement[EB/OL]. [2025-04-03]. https://aclanthology.org/2022.emnlp-main.333/.
[25]	穆大强, 李腾. 基于多模态融合的人脸反欺骗技术[J]. 图学学报, 2020, 41(5): 750-756. DOI
	MU D Q, LI T. Face anti-spoofing technology based on multi-modal fusion[J]. Journal of Graphics, 2020, 41(5): 750-756 (in Chinese).
[26]	孙亚男, 温玉辉, 舒叶芷, 等. 融合动作特征的多模态情绪识别[J]. 图学学报, 2022, 43(6): 1159-1169.
	SUN Y N, WEN Y H, SHU Y Z, et al. Multimodal emotion recognition with action features[J]. Journal of Graphics, 2022, 43(6): 1159-1169 (in Chinese). DOI
[27]	YU Z, YU J, FAN J P, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1839-1848.
[28]	YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6274-6283.
[29]	LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 289-297.
[30]	HAMILTON W, YING Z, LESKOVEC J. Inductive representation learning on large graphs[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017, 30: 1025-1035.
[31]	QIN L B, HUANG S J, CHEN Q G, et al. MMSD2.0: towards a reliable multi-modal sarcasm detection system[EB/OL]. [2025-04-03]. https://aclanthology.org/2023.findings-acl.689/.
[32]	KIM Y. Convolutional neural networks for sentence classification[EB/OL]. [2025-04-03]. https://aclanthology.org/D14-1181/.
[33]	GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. DOI URL
[34]	XIONG T, ZHANG P R, ZHU H B, et al. Sarcasm detection with self-matching networks and low-rank bilinear pooling[C]// The World Wide Web Conference. New York: ACM, 2019: 2115-2124.
[35]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. (2019-07-26) [2025-04-03]. http://arxiv.org/abs/1907.11692.
[36]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[37]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:transformers for image recognition at scale[EB/OL]. [2025-04-03]. https://openreview.net/pdf?id=YicbFdNTTy.
[38]	WU Q F, FANG W L, ZHONG W Y, et al. Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection[J]. Neurocomputing, 2025, 612: 128689. DOI URL