Journal of Graphics ›› 2021, Vol. 42 ›› Issue (3): 439-445.DOI: 10.11996/JG.j.2095-302X.2021030439
• Image Processing and Computer Vision • Previous Articles Next Articles
Online:
Published:
Supported by:
Abstract: The convolutional neural network (CNN) has insufficient ability to understand the time domain information in video action detection. For this problem, we proposed a model based on fused non-local neural network, which combines non-local block with 3D CNN to capture global connections between video frames. Model used a two-stream architecture of 2D CNN and 3D CNN to extract the spatial and motion features of the video, respectively, which takes video single frames and video frame sequences as inputs. To further enhance contextual semantic information, an improved attention and channel fusion mechanism is used to aggregate the features of the above two networks, and finally the fused features are used for frame-level detection. We conducted experimental verification and comparison on the UCF101-24 and JHMDB data set. The results show that our method can fully integrate spatial and temporal information, and has high detection accuracy on video-based action detection tasks.
Key words: action detection, non-local neural network, 3D convolution, attention mechanism 
CLC Number:
TP 391 
HUANG Wen-ming, YANG Mu-li, LAN Ru-shi, DENG Zhen-rong, LUO Xiao-nan . Action detection model fused with non-local neural network[J]. Journal of Graphics, 2021, 42(3): 439-445.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2021030439
http://www.txxb.com.cn/EN/Y2021/V42/I3/439