基于深度学习的视频人体动作识别综述

doi:10.11996/JG.j.2095-302X.2023040625

图学学报 ›› 2023, Vol. 44 ›› Issue (4): 625-639.DOI: 10.11996/JG.j.2095-302X.2023040625

基于深度学习的视频人体动作识别综述

毕春艳¹^,²(), 刘越¹^,²()

1.北京市混合现实与新型显示工程技术研究中心，北京 100081
2.北京理工大学光电学院，北京 100081

收稿日期:2022-10-21 接受日期:2023-04-01 出版日期:2023-08-31 发布日期:2023-08-16
通讯作者: 刘越(1968-)，男，教授，博士。主要研究方向为增强现实、计算机视觉等。E-mail：liuyue@bit.edu.cn
作者简介:
毕春艳(1995-)，女，硕士研究生。主要研究方向为增强现实、计算机视觉和视频动作识别等。E-mail：bichunyan_suda@163.com
基金资助:
国家自然科学基金项目(61960206007);高等学校学科创新引智计划项目(B18005)

A survey of video human action recognition based on deep learning

BI Chun-yan¹^,²(), LIU Yue¹^,²()

1. Beijing Mixed Reality and New Display Engineering Technology Research Center, Beijing 100081, China
2. School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

Received:2022-10-21 Accepted:2023-04-01 Online:2023-08-31 Published:2023-08-16
Contact: LIU Yue (1968-), professor, Ph.D. His main research interests cover augmented reality, computer vision, etc. E-mail：liuyue@bit.edu.cn
About author:
BI Chun-yan (1995-), master student. Her main research interests cover augmented reality, computer vision and video action recognition, etc. E-mail：bichunyan_suda@163.com
Supported by:
National Natural Science Foundation of China(61960206007);Introducing Talents of Discipline to Universities(B18005)

摘要/Abstract

摘要：

随着网络多媒体技术的快速发展和视频采集设备的不断完善，越来越多的视频被共享到网络平台，视频逐渐占据了人类生活，因此视频理解已成为计算机视觉研究的热点之一。作为视频理解的首要任务，对动作识别的研究具有重要的意义。目前基于深度学习的二维图像识别分类方法已经取得了较大的进展，但是视频动作识别仍面临着巨大挑战。其原因在于视频和二维图像相差一个时间维度，对视频中行走、跑步、跳高和跳远等动作的理解不仅需要二维图像所具有的空间语义信息，还需要时序信息。因此，如何利用视频的时序信息对动作识别非常重要。首先介绍了动作识别的研究背景以及发展过程，分析了当前视频动作识别所面临的挑战，然后详细介绍了时序建模及参数优化的方法，分析了常用的动作识别数据集和度量参数，最后对未来的研究方向进行了展望。

关键词: 动作识别, 视频理解, 深度学习, 卷积神经网络, 计算机视觉

Abstract:

With the rapid advancement of network multimedia technology and the continuous improvement of video capture equipment, an increasing number of videos are shared on network platforms, gradually becoming an integral part of human life. Consequently, video understanding has become one of the hot spots of computer vision research, with video understanding being a pivotal task. At present, 2D image recognition classification methods based on deep learning have made significant strides. However, video action recognition still faces a formidable challenge. The reason is that videos differ from 2D images by an additional temporal dimension, and that understanding actions such as walking, running, high jumping, and long jumping in videos requires not only the spatial semantic information that 2D images possess but also temporal information. Therefore, effectively utilizing the temporal information of videos is critical for action recognition. This paper firstly introduced the research background and development process of action recognition, followed by an analysis of the current challenges in video action recognition. The methods of temporal modeling and parameter optimization were then presented in detail, along with an examination of the commonly used action recognition datasets and metric parameters. Finally, the paper outlined the future research directions in this field.

Key words: action recognition, video understanding, deep learning, convolutional neural network, computer vision

中图分类号:

TP391

毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.

BI Chun-yan, LIU Yue. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639.

图/表 12

参考文献 102

[1]	陈万军, 张二虎. 基于深度信息的人体动作识别研究综述[J]. 西安理工大学学报, 2015, 31(3): 253-264, 250.
	CHEN W J, ZHANG E H. A review for human action recognition based on depth data[J]. Journal of Xi’an University of Technology, 2015, 31(3): 253-264, 250 (in Chinese).
[2]	杜友田, 陈峰, 徐文立, 等. 基于视觉的人的运动识别综述[J]. 电子学报, 2007, 35(1): 84-90.
	DU Y T, CHEN F, XU W L, et al. A survey on the vision-based human motion recognition[J]. Acta Electronica Sinica, 2007, 35(1): 84-90 (in Chinese).
[3]	胡琼, 秦磊, 黄庆明. 基于视觉的人体动作识别综述[J]. 计算机学报, 2013, 36(12): 2512-2524.
	HU Q, QIN L, HUANG Q M. A survey on visual human action recognition[J]. Chinese Journal of Computers, 2013, 36(12): 2512-2524 (in Chinese). DOI URL
[4]	李瑞峰, 王亮亮, 王珂. 人体动作行为识别研究综述[J]. 模式识别与人工智能, 2014, 27(1): 35-48.
	LI R F, WANG L L, WANG K. A survey of human body action recognition[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(1): 35-48 (in Chinese).
[5]	黄国范, 李亚. 人体动作姿态识别综述[J]. 电脑知识与技术, 2013, 9(1): 133-135.
	HUANG G F, LI Y. A survey of human action and pose recognition[J]. Computer Knowledge and Technology, 2013, 9(1): 133-135 (in Chinese).
[6]	罗会兰, 王婵娟, 卢飞. 视频行为识别综述[J]. 通信学报, 2018, 39(6): 169-180. DOI
	LUO H L, WANG C J, LU F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169-180 (in Chinese). DOI
[7]	钱慧芳, 易剑平, 付云虎. 基于深度学习的人体动作识别综述[J]. 计算机科学与探索, 2021, 15(3): 438-455. DOI
	QIAN H F, YI J P, FU Y H. Review of human action recognition based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 438-455 (in Chinese).
[8]	钱文祥, 衣杨. 视频识别深度学习网络综述[J]. 计算机科学, 2022, 49(S2): 341-350.
	QIAN W X, YI Y. Summary of video recognition deep learning network[J]. Computer Science, 2022, 49(S2): 341-350 (in Chinese).
[9]	罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报, 2019, 47(5): 1162-1173. DOI
	LUO H L, TONG K, KONG F S. The progress of human action recognition in videos based on deep learning: a review[J]. Acta Electronica Sinica, 2019, 47(5): 1162-1173 (in Chinese).
[10]	黄晴晴, 周风余, 刘美珍. 基于视频的人体动作识别算法综述[J]. 计算机应用研究, 2020, 37(11): 3213-3219.
	HUANG Q Q, ZHOU F Y, LIU M Z. Survey of human action recognition algorithms based on video[J]. Application Research of Computers, 2020, 37(11): 3213-3219 (in Chinese).
[11]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// The 27th International Conference on Neural Information Processing Systems. New York: ACM, 2014: 568-576.
[12]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1933-1941.
[13]	FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal multiplier networks for video action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4768-4777.
[14]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 20-36.
[15]	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]// The 32nd International Conference on International Conference on Machine Learning - Volume 37. New York:ACM, 2015: 448-456.
[16]	LAN Z Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 1219-1225.
[17]	ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 803-818.
[18]	XU B H, YE H, ZHENG Y B, et al. Dense dilated network for video action recognition[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2019, 28(10): 4941-4953. DOI URL
[19]	SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York:ACM, 2015: 802-810.
[20]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. DOI PMID
[21]	DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 2625-2634.
[22]	SUN L, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2147-2156.
[23]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]// International Workshop on Human Behavior Understanding. Heidelberg: Springer, 2011: 29-39.
[24]	JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. DOI URL
[25]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4489-4497.
[26]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6299-6308.
[27]	KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 12103-12112.
[28]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[29]	FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6202-6211.
[30]	XIAO F, LEE Y J, GRAUMAN K, et al. Audiovisual SlowFast networks for video recognition[EB/OL]. (2020-03-09) [2022-01-09]. https://doi.org/10.48550/arXiv.2001.08740.
[31]	FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 203-213.
[32]	ZHU S J, YANG T, MENDIETA M, et al. A3D: adaptive 3D networks for video action recognition[EB/OL]. (2020-11-24) [2022-01-09]. https://arxiv.org/abs/2011.12384.
[33]	DIBA, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. (2017-11-22) [2022-01-09]. https://arxiv.org/abs/1711.08200.
[34]	HE D L, ZHOU Z C, GAN C, et al. StNet: local and global spatial-temporal modeling for action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8401-8408. DOI URL
[35]	QIU Z F, YAO T, NGO C W, et al. Learning spatio-temporal representation with local and global diffusion[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12056-12065.
[36]	TRAN D, WANG H, FEISZLI M, et al. Video classification with channel-separated convolutional networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5552-5561.
[37]	SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4597-4605.
[38]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6450-6459.
[39]	XIE S N, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 305-321.
[40]	ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 695-712.
[41]	LI K C, LI X H, WANG Y L, et al. CT-net: channel tensorization network for video classification[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2106.01603.
[42]	DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D convnets using temporal transition layer[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2018:1117-1121.
[43]	QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5534-5542.
[44]	DIBA A L, FAYYAZ M, SHARMA V, et al. Spatio-temporal channel correlation networks for action classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 284-299.
[45]	LIN J, GAN C, HAN S. Tsm: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7083-7093.
[46]	SHAO H, QIAN S J, LIU Y. Temporal interlacing network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11966-11973. DOI URL
[47]	JIANG B Y, WANG M M, GAN W H, et al. STM: SpatioTemporal and motion encoding for action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2000-2009.
[48]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[49]	LI Y, JI B, SHI X T, et al. TEA: temporal excitation and aggregation for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern. New York: IEEE Press, 2020: 909-918.
[50]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[51]	LIU Z Y, LUO D H, WANG Y B, et al. TEINet: towards an efficient architecture for video recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11669-11676. DOI URL
[52]	LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13708-13718.
[53]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks for action recognition in videos[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(11): 2740-2755. DOI URL
[54]	NG J Y H, DAVIS L S. Temporal difference networks for video action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1587-1596.
[55]	ZHAO Y, XIONG Y J, LIN D H. Recognize actions by disentangling components of dynamics[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6566-6575.
[56]	WANG L M, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1895-1904.
[57]	SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]// The 17th International Conference on Pattern Recognition. New York: IEEE Press, 2004: 32-36.
[58]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// 2011 International Conference on Computer Vision. New York: IEEE Press, 2011: 2556-2563.
[59]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. (2012-12-03) [2022-01-10]. https://arxiv.org/abs/1212.0402.
[60]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[61]	HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 961-970.
[62]	ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1609.08675.
[63]	SIGURDSSON G A, VAROL G, WANG X L, et al. Hollywood in homes: crowdsourcing data collection for activity understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 510-526.
[64]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1705.06950.
[65]	CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[EB/OL]. (2018-08-03) [2022-01-10]. https://arxiv.org/abs/1808.01340.
[66]	CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[EB/OL]. (2022-10-17) [2022-01-10]. https://doi.org/10.48550/arXiv.1907.06987.
[67]	GOYAL R, KAHOU S E, MICHALSKI V, et al. The “something something” video database for learning and evaluating visual common sense[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5842-5850.
[68]	GU C H, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6047-6056.
[69]	LI A, THOTAKURI M, ROSS D A, et al. The AVA-kinetics localized human actions video dataset[EB/OL]. (2020-05-20) [2022-01-10]. https://arxiv.org/abs/2005.00214.
[70]	MONFORT M, ANDONIAN A, ZHOU B L, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(2): 502-508. DOI URL
[71]	ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 8668-8678.
[72]	DIBA A L, FAYYAZ M, SHARMA V, et al. Large scale holistic video understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 593-610.
[73]	PIERGIOVANNI A, RYOO M S. AViD dataset: anonymized videos from diverse countries[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 16711-16721.
[74]	GOWDA S N, ROHRBACH M, SEVILLA-LARA L. SMART frame selection for action recognition[C]// 2020 AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1451-1459.
[75]	IGOR L O B, VICTOR H C M, SCHWARTZ W R. Bubblenet: a disperse recurrent structure to recognize activities[C]// 2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2216-2220.
[76]	STROUD J C, ROSS D A, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]// 2020 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2020: 625-634.
[77]	HONG J, CHO B, HONG Y W, et al. Contextual action cues from camera sensor for multi-stream action recognition[J]. Sensors: Basel, Switzerland, 2019, 19(6): 1382.
[78]	ZHU Y, LAN Z Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]// Asian Conference on Computer Vision. Perth: Springer, 2019: 363-378.
[79]	WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4305-4314.
[80]	NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4694-4702.
[81]	ZHANG B W, WANG L M, WANG Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2718-2726.
[82]	TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. (2017-08-16) [2022-01-10]. https://arxiv.org/abs/1708.05038.
[83]	NG J Y H, CHOI J, NEUMANN J, et al. ActionFlowNet: learning motion representation for action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1616-1624.
[84]	YAN S, XIONG X H, ARNAB A, et al. Multiview transformers for video recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3333-3343.
[85]	ZHANG B W, YU J H, FIFTY C, et al. Co-training transformer with videos and images improves action recognition[EB/OL]. (2021-12-14) [2022-01-10]. https://arxiv.org/abs/2112.07175.
[86]	WEI C, FAN H Q, XIE S N, et al. Masked feature prediction for self-supervised visual pre-training[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14668-14678.
[87]	ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846.
[88]	GHADIYARAM D, TRAN D, MAHAJAN D. Large-scale weakly-supervised pre-training for video action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12046-12055.
[89]	DU X Z, LI Y Q, CUI Y, et al. Revisiting 3D ResNets for video recognition[EB/OL]. (2021-09-03) [2022-01-10]. https://arxiv.org/abs/2109.01696.
[90]	HUANG G X, BORS A G. Busy-quiet video disentangling for video classification[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1341-1350.
[91]	LI Y W, LI Y, VASCONCELOS N. RESOUND: towards action recognition without representation bias[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 513-528.
[92]	GOYAL P, DOLLÁR P, GIRSHICK R, et al. Accurate, large minibatch sgd: training imagenet in 1 hour[EB/OL]. (2018-04-30) [2022-01-10]. https://doi.org/10.48550/arXiv.1706.02677.
[93]	LIN J, GAN C, HAN S. Training kinetics in 15 minutes: large-scale distributed training on videos[EB/OL]. (2019-12-07) [2022-01-10]. https://arxiv.org/abs/1910.00932.
[94]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324.
[95]	BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[EB/OL]. [2022-01-20]. https://arxiv.org/abs/2102.05095v2.
[96]	ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846.
[97]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2010.11929.
[98]	YANG J W, DONG X B, LIU L J, et al. Recurring the transformer for video action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14063-14073.
[99]	CHEN J W, HO C M. MM-ViT: multi-modal video transformer for compressed video action recognition[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1910-1921.
[100]	ROIG C, SARMIENTO M, VARAS D, et al. Multi-modal pyramid feature combination for human action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 3742-3746.
[101]	SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 7463-7472.
[102]	WANG M M, XING J Z, LIU Y. ActionCLIP: a new paradigm for video action recognition[EB/OL]. (2021-09-17) [2022-01-10]. https://arxiv.org/abs/2109.08472.

数据集名称	年份	样本数	平均时长	动作类别	引用数
KTH^[57]	2004	2391	4 s	6	4484
HMDB51^[58]	2011	~7000	~5 s	51	3022
UCF101^[59]	2012	13320	~6 s	101	4001
Sports 1M^[60]	2014	1000000	~5.5 m	487	6565
ActivityNet^[61]	2015	27801	[5, 10] m	203	1488
YouTube8M^[62]	2016	~8000000	229.6 s	3826	1001
Charades^[63]	2016	9848	30.1 s	157	730
Kinetics 400^[64]	2017	306245	10 s	400	2029
Kinetics 600^[65]	2018	495547	10 s	600	206
Kinetics 700^[66]	2019	~650000	10 s	700	185
Sth-Sth V1	2017	108499	[2, 6] s	174	-
Sth-Sth V2^[67]	2017	220847	[2, 6] s	174	545
AVA^[68]	2018	>392416	15 m	80	570
AVA-Kinetics^[69]	2020	>238000	15 m, 10 s	80	40
MIT^[70]	2018	~1000000	3 s	339	320
HACS Clips^[71]	2019	~1500000	2 s	31	114
HUV^[72]	2020	~572000	10 s	739	35
AViD^[73]	2020	~450000	[3, 15] s	887	17

数据集名称	年份	样本数	平均时长	动作类别	引用数
KTH^[57]	2004	2391	4 s	6	4484
HMDB51^[58]	2011	~7000	~5 s	51	3022
UCF101^[59]	2012	13320	~6 s	101	4001
Sports 1M^[60]	2014	1000000	~5.5 m	487	6565
ActivityNet^[61]	2015	27801	[5, 10] m	203	1488
YouTube8M^[62]	2016	~8000000	229.6 s	3826	1001
Charades^[63]	2016	9848	30.1 s	157	730
Kinetics 400^[64]	2017	306245	10 s	400	2029
Kinetics 600^[65]	2018	495547	10 s	600	206
Kinetics 700^[66]	2019	~650000	10 s	700	185
Sth-Sth V1	2017	108499	[2, 6] s	174	-
Sth-Sth V2^[67]	2017	220847	[2, 6] s	174	545
AVA^[68]	2018	>392416	15 m	80	570
AVA-Kinetics^[69]	2020	>238000	15 m, 10 s	80	40
MIT^[70]	2018	~1000000	3 s	339	320
HACS Clips^[71]	2019	~1500000	2 s	31	114
HUV^[72]	2020	~572000	10 s	739	35
AViD^[73]	2020	~450000	[3, 15] s	887	17

序号	模型	年份	Top-3	是否有额外的训练数据
1	SMART^[74]	2020	98.64	F
2	LGD-3D two-stream^[35]	2019	98.20	F
3	BubbleNET^[75]	2020	97.62	F
4	D3D + D3D^[76]	2018	97.60	F
5	Multi-stream I3D^[77]	2019	97.20	F
6	Hidden two-stream^[78]	2017	97.10	F
7	TSN^[14]	2016	94.20	F
8	Two-stream I3D^[26]	2017	93.40	F
9	TDD+IDT^[79]	2015	91.50	F
10	Two-stream+LSTM^[80]	2015	88.60	F
11	P3D(ImageNet+Sports1M)^[43]	2017	88.60	R
12	Two-Stream(ImageNet pretrained)^[80]	2014	88.00	R
13	MV-CNN^[81]	2016	86.40	F
14	Res3D^[82]	2017	85.80	F
15	ActionFlowNet^[83]	2016	83.90	F
16	C3D^[25]	2014	82.30	F

序号	模型	年份	Top-3	是否有额外的训练数据
1	SMART^[74]	2020	98.64	F
2	LGD-3D two-stream^[35]	2019	98.20	F
3	BubbleNET^[75]	2020	97.62	F
4	D3D + D3D^[76]	2018	97.60	F
5	Multi-stream I3D^[77]	2019	97.20	F
6	Hidden two-stream^[78]	2017	97.10	F
7	TSN^[14]	2016	94.20	F
8	Two-stream I3D^[26]	2017	93.40	F
9	TDD+IDT^[79]	2015	91.50	F
10	Two-stream+LSTM^[80]	2015	88.60	F
11	P3D(ImageNet+Sports1M)^[43]	2017	88.60	R
12	Two-Stream(ImageNet pretrained)^[80]	2014	88.00	R
13	MV-CNN^[81]	2016	86.40	F
14	Res3D^[82]	2017	85.80	F
15	ActionFlowNet^[83]	2016	83.90	F
16	C3D^[25]	2014	82.30	F

序号	模型	年份	Acc@1	Acc@5	是否有额外的训练数据
1	MTV-H(WT 60M)^[84]	2022	89.1	98.2	R
2	CoWeR(JFT-3B)^[85]	2021	87.2	97.5	R
3	MaskFeat(K600,MViT-L)^[86]	2021	87.0	97.4	R
4	ViViT-H/14x2(JFT)^[87]	2021	84.9	95.8	R
5	Ir-CSN-152(IG-65M^[88])^[36]	2019	82.6	-	R
6	Ip-CSN-152(IG-65M)^[36]	2019	83.5	95.3	R
7	R(2+1)D(IG-65M)^[36]	2019	81.3	95.1	R
8	X3D-XXL^[31]	2020	80.4	94.6	F
9	R3D-RS-200^[89]	2021	80.4	94.4	F
10	SlowFast 16x8 (ResNet-101+NL)^[29]	2018	79.8	-	F
11	TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)^[56]	2020	79.4	94.4	F
12	I3D+NL^[28]	2017	77.7	93.3	F
13	BQN^[90]	2020	77.3	93.2	R
14	TSM^[45]	2018	74.7	-	F
15	R[2+1]D-RGB (Sports-1M pretrain)^[38]	2017	74.3	91.4	R
16	R[2+1]D-Two-Stream^[38]	2017	73.9	90.9	F
17	TSN^[14]	2016	73.9	91.1	F

基于深度学习的视频人体动作识别综述

A survey of video human action recognition based on deep learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 102

相关文章 15

编辑推荐

Metrics

本文评价

Method	Pretrained	Top@3/ HMDB-51	Top3/ Ucf101	Top5/ Ucf101
Two_Stream I3D	ImageNet+Kinetics pre-training	80.7	98.0	88.8
Two_Stream I3D	ImageNet pre-training	66.4	93.4	88.8
Two_Stream I3D	Kinetics pre-training	80.9	97.8	88.8
R[2+1]D-TwoStream	Kinetics pre-training	78.7	97.3	90.9
R[2+1]D-TwoStream	Sports-1M pretrained	72.7	95.0	90.9
R[2+1]D-RGB	Kinetics pretrained	74.5	96.8	90.0
R[2+1]D-RGB	Sports-1M pretrained	66.6	93.6	90.0
R[2+1]D-Flow	Kinetics pretrained	76.4	95.5	87.2
R[2+1]D-Flow	Sports-1M pretrained	70.1	93.3	87.2
Two-Stream	ImageNet pretrained	59.4	88.0	-

[1]	杨陈成 , 董秀成 , 侯兵 , 张党成 , 向贤明 , 冯琪茗 . 基于参考的Transformer纹理迁移深度图像超分辨率重建 [J]. 图学学报, 2023, 44(5): 861-867.
[2]	党宏社 , 许怀彪 , 张选德 . 融合结构信息的深度学习立体匹配算法 [J]. 图学学报, 2023, 44(5): 899-906.
[3]	翟永杰, 郭聪彬, 王乾铭, 赵宽, 白云山, 张冀 . 基于隐含空间知识融合的输电线路多金具检测方法 [J]. 图学学报, 2023, 44(5): 918-927.
[4]	杨红菊, 高敏, 张常有, 薄文, 武文佳, 曹付元. 一种面向图像修复的局部优化生成模型 [J]. 图学学报, 2023, 44(5): 955-965.
[5]	曹义亲, 周一纬, 徐露. 基于E-YOLOX的实时金属表面缺陷检测算法[J]. 图学学报, 2023, 44(4): 677-690.
[6]	李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709.
[7]	邵俊棋, 钱文华, 徐启豪. 基于条件残差生成对抗网络的风景图生成[J]. 图学学报, 2023, 44(4): 710-717.
[8]	邓渭铭, 杨铁军, 李纯纯, 黄琳. 基于神经网络架构搜索的铭牌目标检测方法[J]. 图学学报, 2023, 44(4): 718-727.
[9]	余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738.
[10]	郭印宏, 王立春, 李爽. 基于重复性和特异性约束的图像特征匹配[J]. 图学学报, 2023, 44(4): 739-746.
[11]	毛爱坤, 刘昕明, 陈文壮, 宋绍楼. 改进YOLOv5算法的变电站仪表目标检测方法[J]. 图学学报, 2023, 44(3): 448-455.
[12]	王佳婧, 王晨, 朱媛媛, 王笑梅. 基于民国纸币的图元素匹配检索[J]. 图学学报, 2023, 44(3): 492-501.
[13]	杨柳, 吴晓群. 基于深度学习的三维形状补全研究综述[J]. 图学学报, 2023, 44(2): 201-215.
[14]	曾武, 朱恒亮, 邢树礼, 林江宏, 毛国君. 显著性检测引导的图像数据增强方法[J]. 图学学报, 2023, 44(2): 260-270.
[15]	罗启明, 吴昊, 夏信, 袁国武. 基于Dual Dense U-Net的云南壁画破损区域预测[J]. 图学学报, 2023, 44(2): 304-312.