欢迎访问《图学学报》 分享到:

图学学报

• 图像处理与计算机视觉 • 上一篇    下一篇

基于 ResNeXt 的人体动作识别

  

  1. (郑州大学信息工程学院,河南 郑州 450000)
  • 出版日期:2020-04-30 发布日期:2020-05-15
  • 基金资助:
    国家自然科学基金项目(U1804152,61806180)

Human action recognition based on ResNeXt

  1. (School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450000, China)
  • Online:2020-04-30 Published:2020-05-15

摘要: 人体动作识别是计算机视觉领域的核心研究方向之一,在很多场合都有应用。深 度卷积神经网络在静态图像识别方面已取得了巨大成功,并逐渐扩展到视频内容识别领域,但 应用依然面临很大挑战。为此提出一种基于 ResNeXt 深度神经网络模型用于视频中的人体动作 识别,主要包括:①使用新型 ResNeXt 网络结构代替原有的各种卷积神经网络结构,并使用 RGB 和光流 2 种模态的数据,使模型可充分地利用视频中动作外观及时序信息;②将端到端的 视频时间分割策略应用于 ResNeXt 网络模型,同时将视频分为 K 段实现对视频序列的长范围时 间结构进行建模,并通过测试得到最优视频分段值 K,使模型能更好地区分存在子动作共享现 象的相似动作,解决某些由于子动作相似而易发生的误判问题。通过在动作识别数据集 UCF101 和 HMDB51 上进行的测试表明,该模型和方法的动作识别准确率性能优于目前文献中的一些模 型和方法的性能。

关键词: 动作识别, ResNeXt, 视频时间分割, 数据增强, 多模态

Abstract: Human action recognition is one of the core research directions in the field of computer vision and is applied in many occasions. Deep convolutional neural networks have achieved great success in static image recognition and have gradually expanded into the field of video content recognition, but they still face great challenges in applications. This paper proposes a deep neural network model based on ResNeXt network for human action recognition in video. The main innovations of this paper include: ① The new ResNeXt network structure was used to replace the original convolutional neural network structure. Two kinds of modal data of RGB and optical flow was collected to make full use of the appearance and temporal order information in the video. ② The end-to-end video time segmentation strategy was applied to the proposed ResNeXt network model. The video was divided into K segments to model the long-range time structure of the video sequence, and the optimal value of K was obtained through tests, which enables the model to better distinguish the similar actions with sub-action sharing phenomenon and solve the problems of misjudgment that are easy to emerge due to similar sub-actions. Tests performed on the widely used action recognition data sets UCF101 and HMDB51 showed that the action recognition accuracy of the proposed model and method is better than that of the models and methods in the existing literature.

Key words: action recognition, ResNeXt, video temporal segmentation, data enhancement, multimodal