Welcome to Journal of Graphics

Journal of Graphics ›› 2026, Vol. 47 ›› Issue (2): 286-295.DOI: 10.11996/JG.j.2095-302X.2026020286

• Image Processing and Computer Vision • Previous Articles     Next Articles

Cross-modal consistency detection via graph topological feature extraction

FANG Youjiang, WANG Shihao, ZHANG Liang, DUAN Keran, LIU Yue, WEI Xiaopeng, YANG Xin()   

  1. Key Laboratory of Social Computing and Cognitive Intelligence, Ministry of Education, Dalian University of Technology, Dalian Liaoning 116024, China
  • Received:2025-06-03 Accepted:2025-12-13 Online:2026-04-30 Published:2026-05-20
  • Contact: YANG Xin
  • Supported by:
    The Ministry of Education Humanities and Social Sciences Research Project Youth Fund Project(22YJCZH116)

Abstract:

With the rapid development of social media, massive multimodal content is extensively disseminated during public opinion events, making automated public opinion monitoring a critical tool for social governance and early risk warning. Complex linguistic expressions such as sarcasm and metaphor frequently appear in online discourse and are often characterized by inconsistencies between textual and visual modalities, which significantly complicates automatic detection. Existing cross-modal consistency detection methods face limitations in structurally modeling unimodal and multimodal information and in capturing deep semantic correlations, hindering the precise control of real-world public opinion trends. To address these issues, a Graph-structure-aware Cross-modal Public Opinion Network (GCPNet) is proposed. First, the CLIP (Contrastive Language-Image Pretraining) model was utilized as a feature encoder, and fully connected graph topological structures were constructed with textual words and image patches as nodes. Graph Convolutional Networks (GCNs) were employed to explicitly mine and enhance the semantic and structural correlations within multimodal information. Second, a hierarchical interactive attention graph module was designed to improve global modeling and deep interaction capabilities for complex contexts through three stages: fine-grained cross-attention alignment, global adaptive gating fusion, and dynamic graph structure enhancement. Finally, an adaptive weighted fusion strategy was adopted to dynamically integrate unimodal structured features and cross-modal interactive features. Experimental results on the public benchmark dataset MMSD2.0 show that GCPNet accurately captured cross-modal consistency cues and effectively identified complex public opinion content such as sarcasm and metaphor, outperforming existing state-of-the-art methods in terms of accuracy and robustness. This research provides a new methodological pathway and theoretical foundation for multimodal public opinion understanding, offering a practical tool for real-world opinion governance and social risk mitigation.

Key words: multimodal public opinion monitoring, graph topology extraction, cross-modal feature fusion, attention mechanism, multimodal fusion

CLC Number: