Cross-modal consistency detection via graph topological feature extraction

doi:10.11996/JG.j.2095-302X.2026020286

Abstract

Abstract:

With the rapid development of social media, massive multimodal content is extensively disseminated during public opinion events, making automated public opinion monitoring a critical tool for social governance and early risk warning. Complex linguistic expressions such as sarcasm and metaphor frequently appear in online discourse and are often characterized by inconsistencies between textual and visual modalities, which significantly complicates automatic detection. Existing cross-modal consistency detection methods face limitations in structurally modeling unimodal and multimodal information and in capturing deep semantic correlations, hindering the precise control of real-world public opinion trends. To address these issues, a Graph-structure-aware Cross-modal Public Opinion Network (GCPNet) is proposed. First, the CLIP (Contrastive Language-Image Pretraining) model was utilized as a feature encoder, and fully connected graph topological structures were constructed with textual words and image patches as nodes. Graph Convolutional Networks (GCNs) were employed to explicitly mine and enhance the semantic and structural correlations within multimodal information. Second, a hierarchical interactive attention graph module was designed to improve global modeling and deep interaction capabilities for complex contexts through three stages: fine-grained cross-attention alignment, global adaptive gating fusion, and dynamic graph structure enhancement. Finally, an adaptive weighted fusion strategy was adopted to dynamically integrate unimodal structured features and cross-modal interactive features. Experimental results on the public benchmark dataset MMSD2.0 show that GCPNet accurately captured cross-modal consistency cues and effectively identified complex public opinion content such as sarcasm and metaphor, outperforming existing state-of-the-art methods in terms of accuracy and robustness. This research provides a new methodological pathway and theoretical foundation for multimodal public opinion understanding, offering a practical tool for real-world opinion governance and social risk mitigation.

Key words: multimodal public opinion monitoring, graph topology extraction, cross-modal feature fusion, attention mechanism, multimodal fusion

CLC Number:

FANG Youjiang, WANG Shihao, ZHANG Liang, DUAN Keran, LIU Yue, WEI Xiaopeng, YANG Xin. Cross-modal consistency detection via graph topological feature extraction[J]. Journal of Graphics, 2026, 47(2): 286-295.

Figures/Tables 6

References 38

[1]	WEN C S, JIA G L, YANG J F. DIP: dual incongruity perceiving network for sarcasm detection[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 2540-2550.
[2]	VERMA P, SHUKLA N, SHUKLA A P. Techniques of sarcasm detection: a review[C]// 2021 International Conference on Advance Computing and Innovative Technologies in Engineering. New York: IEEE Press, 2021: 968-972.
[3]	GODARA J, ARON R, SHABAZ M. Sentiment analysis and sarcasm detection from social network to train health-care professionals[J]. World Journal of Engineering, 2022, 19(1): 124-133. DOI URL
[4]	LI J N, PAN H L, LIN Z, et al. Sarcasm detection with commonsense knowledge[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3192-3201. DOI URL
[5]	RAO M V, C S. Detection of sarcasm on amazon product reviews using machine learning algorithms under sentiment analysis[C]// The 6th International Conference on Wireless Communications, Signal Processing and Networking. New York: IEEE Press, 2021: 196-199.
[6]	ZHANG Y Z, WANG J L, LIU Y C, et al. A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations[J]. Information Fusion, 2023, 93: 282-301. DOI URL
[7]	DUTTA P, BHATTACHARYYA C K. Multi-modal sarcasm detection in social networks: a comparative review[C]// The 6th International Conference on Computing Methodologies and Communication. New York: IEEE Press, 2022: 207-214.
[8]	SCHIFANELLA R, DE JUAN P, TETREAULT J, et al. Detecting sarcasm in multimodal social platforms[C]// The 24th ACM International Conference on Multimedia. New York: ACM, 2016: 1136-1145.
[9]	XU N, ZENG Z X, MAO W J. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[EB/OL]. [2025-04-03]. https://aclanthology.org/2020.acl-main.349/.
[10]	LIANG B, LOU C W, LI X, et al. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs[C]// The 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4707-4715.
[11]	TIAN Y, XU N, ZHANG R K, et al. Dynamic routing transformer network for multimodal sarcasm detection[EB/OL]. [2025-04-03]. https://aclanthology.org/2023.acl-long.139/.
[12]	WEI Y W, YUAN S Z, ZHOU H Y, et al. G²SAM: graph-based global semantic awareness method for multimodal sarcasm detection[C]// The 38th AAAI Conference on Artificial Intelligence. Washington: AAAI Press, 2024: 9151-9159.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2025-04-03]. https://proceedings.mlr.press/v139/radford21a.
[14]	甘宇祥, 王亚博, 薛均晓, 等. 基于情感特征的新冠肺炎疫情舆情演化分析[J]. 图学学报, 2021, 42(2): 222-229.
	GAN Y X, WANG Y B, XUE J X, et al. Public opinion evolution analysis of “COVID-19 epidemic” based on sentiment feature[J]. Journal of Graphics, 2021, 42(2): 222-229 (in Chinese).
[15]	黄欢, 孙力娟, 曹莹, 等. 基于注意力的短视频多模态情感分析[J]. 图学学报, 2021, 42(1): 8-14.
	HUANG H, SUN L J, CAO Y, et al. Multimodal sentiment analysis of short videos based on attention[J]. Journal of Graphics, 2021, 42(1): 8-14 (in Chinese). DOI
[16]	ALQAHTANI A, ALHENAKI L, ALSHEDDI A. Text-based sarcasm detection on social networks: a systematic review[J]. International Journal of Advanced Computer Science and Applications, 2023, 14(3): 313-328.
[17]	SHRIVASTAVA M, KUMAR S. A pragmatic and intelligent model for sarcasm detection in social media text[J]. Technology in Society, 2021, 64: 101489. DOI URL
[18]	GUPTA S, SINGH R, SINGLA V. Emoticon and text sarcasm detection in sentiment analysis[C]// The 1st International Conference on Sustainable Technologies for Computational Intelligence. Cham: Springer, 2020: 1-10.
[19]	LIU J, TIAN S W, YU L, et al. Image-text fusion transformer network for sarcasm detection[J]. Multimedia Tools and Applications, 2024, 83(14): 41895-41909. DOI
[20]	PAN H L, LIN Z, FU P, et al. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[EB/OL]. [2025-04-03]. https://aclanthology.org/2020.findings-emnlp.124/.
[21]	SANGWAN S, AKHTAR M S, BEHERA P, et al. I didn’t mean what I wrote! Exploring multimodality for sarcasm detection[C]// 2020 International Joint Conference on Neural Networks. New York: IEEE Press, 2020: 1-8.
[22]	CAI Y T, CAI H Y, WAN X J. Multi-modal sarcasm detection in twitter with hierarchical fusion model[EB/OL]. [2025-04-03]. https://aclanthology.org/P19-1239/.
[23]	LIANG B, LOU C W, LI X, et al. Multi-modal sarcasm detection via cross-modal graph convolutional network[EB/OL]. [2025-04-03]. https://aclanthology.org/2022.acl-long.124/.
[24]	LIU H, WANG W Y, LI H L. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement[EB/OL]. [2025-04-03]. https://aclanthology.org/2022.emnlp-main.333/.
[25]	穆大强, 李腾. 基于多模态融合的人脸反欺骗技术[J]. 图学学报, 2020, 41(5): 750-756. DOI
	MU D Q, LI T. Face anti-spoofing technology based on multi-modal fusion[J]. Journal of Graphics, 2020, 41(5): 750-756 (in Chinese).
[26]	孙亚男, 温玉辉, 舒叶芷, 等. 融合动作特征的多模态情绪识别[J]. 图学学报, 2022, 43(6): 1159-1169.
	SUN Y N, WEN Y H, SHU Y Z, et al. Multimodal emotion recognition with action features[J]. Journal of Graphics, 2022, 43(6): 1159-1169 (in Chinese). DOI
[27]	YU Z, YU J, FAN J P, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1839-1848.
[28]	YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6274-6283.
[29]	LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 289-297.
[30]	HAMILTON W, YING Z, LESKOVEC J. Inductive representation learning on large graphs[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017, 30: 1025-1035.
[31]	QIN L B, HUANG S J, CHEN Q G, et al. MMSD2.0: towards a reliable multi-modal sarcasm detection system[EB/OL]. [2025-04-03]. https://aclanthology.org/2023.findings-acl.689/.
[32]	KIM Y. Convolutional neural networks for sentence classification[EB/OL]. [2025-04-03]. https://aclanthology.org/D14-1181/.
[33]	GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. DOI URL
[34]	XIONG T, ZHANG P R, ZHU H B, et al. Sarcasm detection with self-matching networks and low-rank bilinear pooling[C]// The World Wide Web Conference. New York: ACM, 2019: 2115-2124.
[35]	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. (2019-07-26) [2025-04-03]. http://arxiv.org/abs/1907.11692.
[36]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[37]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:transformers for image recognition at scale[EB/OL]. [2025-04-03]. https://openreview.net/pdf?id=YicbFdNTTy.
[38]	WU Q F, FANG W L, ZHONG W Y, et al. Dual-level adaptive incongruity-enhanced model for multimodal sarcasm detection[J]. Neurocomputing, 2025, 612: 128689. DOI URL

参数	Train	Validation	Test
All	19 812	2 410	2 409
Positive	9 572	1 042	1 037
Negative	10 240	1 368	1 372

参数	Train	Validation	Test
All	19 812	2 410	2 409
Positive	9 572	1 042	1 037
Negative	10 240	1 368	1 372

模态	模型	Acc	P	R	F1
文本模态	TextCNN ^[32]	71.61	64.62	75.22	69.52
	Bi-LSTM ^[33]	72.48	68.02	68.08	68.05
	SMSD ^[34]	73.56	68.45	71.55	69.97
	RoBERTa ^[35]	79.66	76.74	75.70	76.21
图像模态	ResNet ^[36]	65.50	61.17	54.39	57.58
图像模态	ViT ^[37]	72.02	65.26	74.83	69.72
跨模态	HFM ^[22]	70.57	64.84	69.05	66.88
	Att-BERT ^[20]	80.03	76.28	77.82	77.04
	CMGCN ^[23]	79.83	75.82	78.01	76.90
	HKE ^[24]	76.50	73.48	71.07	72.25
	DIP ^[1]	80.59	75.52	81.14	78.23
	DynRT ^[11]	70.37	63.02	75.15	68.55
	G²SAM ^[12]	79.43	72.04	85.20	78.04
	Multi-view CLIP ^[31]	85.64	80.33	88.24	84.10
	DAIE ^[38]	84.33	82.43	81.91	82.17
本文	GCPNet (ours)	86.06	83.74	87.15	85.41

模态	模型	Acc	P	R	F1
文本模态	TextCNN ^[32]	71.61	64.62	75.22	69.52
	Bi-LSTM ^[33]	72.48	68.02	68.08	68.05
	SMSD ^[34]	73.56	68.45	71.55	69.97
	RoBERTa ^[35]	79.66	76.74	75.70	76.21
图像模态	ResNet ^[36]	65.50	61.17	54.39	57.58
图像模态	ViT ^[37]	72.02	65.26	74.83	69.72
跨模态	HFM ^[22]	70.57	64.84	69.05	66.88
	Att-BERT ^[20]	80.03	76.28	77.82	77.04
	CMGCN ^[23]	79.83	75.82	78.01	76.90
	HKE ^[24]	76.50	73.48	71.07	72.25
	DIP ^[1]	80.59	75.52	81.14	78.23
	DynRT ^[11]	70.37	63.02	75.15	68.55
	G²SAM ^[12]	79.43	72.04	85.20	78.04
	Multi-view CLIP ^[31]	85.64	80.33	88.24	84.10
	DAIE ^[38]	84.33	82.43	81.91	82.17
本文	GCPNet (ours)	86.06	83.74	87.15	85.41