A survey of video human action recognition based on deep learning

doi:10.11996/JG.j.2095-302X.2023040625

Abstract

Abstract:

With the rapid advancement of network multimedia technology and the continuous improvement of video capture equipment, an increasing number of videos are shared on network platforms, gradually becoming an integral part of human life. Consequently, video understanding has become one of the hot spots of computer vision research, with video understanding being a pivotal task. At present, 2D image recognition classification methods based on deep learning have made significant strides. However, video action recognition still faces a formidable challenge. The reason is that videos differ from 2D images by an additional temporal dimension, and that understanding actions such as walking, running, high jumping, and long jumping in videos requires not only the spatial semantic information that 2D images possess but also temporal information. Therefore, effectively utilizing the temporal information of videos is critical for action recognition. This paper firstly introduced the research background and development process of action recognition, followed by an analysis of the current challenges in video action recognition. The methods of temporal modeling and parameter optimization were then presented in detail, along with an examination of the commonly used action recognition datasets and metric parameters. Finally, the paper outlined the future research directions in this field.

Key words: action recognition, video understanding, deep learning, convolutional neural network, computer vision

CLC Number:

TP391

BI Chun-yan, LIU Yue. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639.

Figures/Tables 12

References 102

[1]	陈万军, 张二虎. 基于深度信息的人体动作识别研究综述[J]. 西安理工大学学报, 2015, 31(3): 253-264, 250.
	CHEN W J, ZHANG E H. A review for human action recognition based on depth data[J]. Journal of Xi’an University of Technology, 2015, 31(3): 253-264, 250 (in Chinese).
[2]	杜友田, 陈峰, 徐文立, 等. 基于视觉的人的运动识别综述[J]. 电子学报, 2007, 35(1): 84-90.
	DU Y T, CHEN F, XU W L, et al. A survey on the vision-based human motion recognition[J]. Acta Electronica Sinica, 2007, 35(1): 84-90 (in Chinese).
[3]	胡琼, 秦磊, 黄庆明. 基于视觉的人体动作识别综述[J]. 计算机学报, 2013, 36(12): 2512-2524.
	HU Q, QIN L, HUANG Q M. A survey on visual human action recognition[J]. Chinese Journal of Computers, 2013, 36(12): 2512-2524 (in Chinese). DOI URL
[4]	李瑞峰, 王亮亮, 王珂. 人体动作行为识别研究综述[J]. 模式识别与人工智能, 2014, 27(1): 35-48.
	LI R F, WANG L L, WANG K. A survey of human body action recognition[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(1): 35-48 (in Chinese).
[5]	黄国范, 李亚. 人体动作姿态识别综述[J]. 电脑知识与技术, 2013, 9(1): 133-135.
	HUANG G F, LI Y. A survey of human action and pose recognition[J]. Computer Knowledge and Technology, 2013, 9(1): 133-135 (in Chinese).
[6]	罗会兰, 王婵娟, 卢飞. 视频行为识别综述[J]. 通信学报, 2018, 39(6): 169-180. DOI
	LUO H L, WANG C J, LU F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169-180 (in Chinese). DOI
[7]	钱慧芳, 易剑平, 付云虎. 基于深度学习的人体动作识别综述[J]. 计算机科学与探索, 2021, 15(3): 438-455. DOI
	QIAN H F, YI J P, FU Y H. Review of human action recognition based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 438-455 (in Chinese).
[8]	钱文祥, 衣杨. 视频识别深度学习网络综述[J]. 计算机科学, 2022, 49(S2): 341-350.
	QIAN W X, YI Y. Summary of video recognition deep learning network[J]. Computer Science, 2022, 49(S2): 341-350 (in Chinese).
[9]	罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报, 2019, 47(5): 1162-1173. DOI
	LUO H L, TONG K, KONG F S. The progress of human action recognition in videos based on deep learning: a review[J]. Acta Electronica Sinica, 2019, 47(5): 1162-1173 (in Chinese).
[10]	黄晴晴, 周风余, 刘美珍. 基于视频的人体动作识别算法综述[J]. 计算机应用研究, 2020, 37(11): 3213-3219.
	HUANG Q Q, ZHOU F Y, LIU M Z. Survey of human action recognition algorithms based on video[J]. Application Research of Computers, 2020, 37(11): 3213-3219 (in Chinese).
[11]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// The 27th International Conference on Neural Information Processing Systems. New York: ACM, 2014: 568-576.
[12]	FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1933-1941.
[13]	FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal multiplier networks for video action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4768-4777.
[14]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 20-36.
[15]	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]// The 32nd International Conference on International Conference on Machine Learning - Volume 37. New York:ACM, 2015: 448-456.
[16]	LAN Z Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 1219-1225.
[17]	ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 803-818.
[18]	XU B H, YE H, ZHENG Y B, et al. Dense dilated network for video action recognition[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2019, 28(10): 4941-4953. DOI URL
[19]	SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York:ACM, 2015: 802-810.
[20]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. DOI PMID
[21]	DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 2625-2634.
[22]	SUN L, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2147-2156.
[23]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]// International Workshop on Human Behavior Understanding. Heidelberg: Springer, 2011: 29-39.
[24]	JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. DOI URL
[25]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4489-4497.
[26]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6299-6308.
[27]	KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 12103-12112.
[28]	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803.
[29]	FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6202-6211.
[30]	XIAO F, LEE Y J, GRAUMAN K, et al. Audiovisual SlowFast networks for video recognition[EB/OL]. (2020-03-09) [2022-01-09]. https://doi.org/10.48550/arXiv.2001.08740.
[31]	FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 203-213.
[32]	ZHU S J, YANG T, MENDIETA M, et al. A3D: adaptive 3D networks for video action recognition[EB/OL]. (2020-11-24) [2022-01-09]. https://arxiv.org/abs/2011.12384.
[33]	DIBA, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. (2017-11-22) [2022-01-09]. https://arxiv.org/abs/1711.08200.
[34]	HE D L, ZHOU Z C, GAN C, et al. StNet: local and global spatial-temporal modeling for action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8401-8408. DOI URL
[35]	QIU Z F, YAO T, NGO C W, et al. Learning spatio-temporal representation with local and global diffusion[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12056-12065.
[36]	TRAN D, WANG H, FEISZLI M, et al. Video classification with channel-separated convolutional networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5552-5561.
[37]	SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4597-4605.
[38]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6450-6459.
[39]	XIE S N, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 305-321.
[40]	ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 695-712.
[41]	LI K C, LI X H, WANG Y L, et al. CT-net: channel tensorization network for video classification[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2106.01603.
[42]	DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D convnets using temporal transition layer[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2018:1117-1121.
[43]	QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5534-5542.
[44]	DIBA A L, FAYYAZ M, SHARMA V, et al. Spatio-temporal channel correlation networks for action classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 284-299.
[45]	LIN J, GAN C, HAN S. Tsm: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7083-7093.
[46]	SHAO H, QIAN S J, LIU Y. Temporal interlacing network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11966-11973. DOI URL
[47]	JIANG B Y, WANG M M, GAN W H, et al. STM: SpatioTemporal and motion encoding for action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2000-2009.
[48]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[49]	LI Y, JI B, SHI X T, et al. TEA: temporal excitation and aggregation for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern. New York: IEEE Press, 2020: 909-918.
[50]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[51]	LIU Z Y, LUO D H, WANG Y B, et al. TEINet: towards an efficient architecture for video recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11669-11676. DOI URL
[52]	LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13708-13718.
[53]	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks for action recognition in videos[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(11): 2740-2755. DOI URL
[54]	NG J Y H, DAVIS L S. Temporal difference networks for video action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1587-1596.
[55]	ZHAO Y, XIONG Y J, LIN D H. Recognize actions by disentangling components of dynamics[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6566-6575.
[56]	WANG L M, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1895-1904.
[57]	SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]// The 17th International Conference on Pattern Recognition. New York: IEEE Press, 2004: 32-36.
[58]	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// 2011 International Conference on Computer Vision. New York: IEEE Press, 2011: 2556-2563.
[59]	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. (2012-12-03) [2022-01-10]. https://arxiv.org/abs/1212.0402.
[60]	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732.
[61]	HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 961-970.
[62]	ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1609.08675.
[63]	SIGURDSSON G A, VAROL G, WANG X L, et al. Hollywood in homes: crowdsourcing data collection for activity understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 510-526.
[64]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1705.06950.
[65]	CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[EB/OL]. (2018-08-03) [2022-01-10]. https://arxiv.org/abs/1808.01340.
[66]	CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[EB/OL]. (2022-10-17) [2022-01-10]. https://doi.org/10.48550/arXiv.1907.06987.
[67]	GOYAL R, KAHOU S E, MICHALSKI V, et al. The “something something” video database for learning and evaluating visual common sense[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5842-5850.
[68]	GU C H, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6047-6056.
[69]	LI A, THOTAKURI M, ROSS D A, et al. The AVA-kinetics localized human actions video dataset[EB/OL]. (2020-05-20) [2022-01-10]. https://arxiv.org/abs/2005.00214.
[70]	MONFORT M, ANDONIAN A, ZHOU B L, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(2): 502-508. DOI URL
[71]	ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 8668-8678.
[72]	DIBA A L, FAYYAZ M, SHARMA V, et al. Large scale holistic video understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 593-610.
[73]	PIERGIOVANNI A, RYOO M S. AViD dataset: anonymized videos from diverse countries[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 16711-16721.
[74]	GOWDA S N, ROHRBACH M, SEVILLA-LARA L. SMART frame selection for action recognition[C]// 2020 AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1451-1459.
[75]	IGOR L O B, VICTOR H C M, SCHWARTZ W R. Bubblenet: a disperse recurrent structure to recognize activities[C]// 2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2216-2220.
[76]	STROUD J C, ROSS D A, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]// 2020 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2020: 625-634.
[77]	HONG J, CHO B, HONG Y W, et al. Contextual action cues from camera sensor for multi-stream action recognition[J]. Sensors: Basel, Switzerland, 2019, 19(6): 1382.
[78]	ZHU Y, LAN Z Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]// Asian Conference on Computer Vision. Perth: Springer, 2019: 363-378.
[79]	WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4305-4314.
[80]	NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4694-4702.
[81]	ZHANG B W, WANG L M, WANG Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2718-2726.
[82]	TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. (2017-08-16) [2022-01-10]. https://arxiv.org/abs/1708.05038.
[83]	NG J Y H, CHOI J, NEUMANN J, et al. ActionFlowNet: learning motion representation for action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1616-1624.
[84]	YAN S, XIONG X H, ARNAB A, et al. Multiview transformers for video recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3333-3343.
[85]	ZHANG B W, YU J H, FIFTY C, et al. Co-training transformer with videos and images improves action recognition[EB/OL]. (2021-12-14) [2022-01-10]. https://arxiv.org/abs/2112.07175.
[86]	WEI C, FAN H Q, XIE S N, et al. Masked feature prediction for self-supervised visual pre-training[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14668-14678.
[87]	ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846.
[88]	GHADIYARAM D, TRAN D, MAHAJAN D. Large-scale weakly-supervised pre-training for video action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12046-12055.
[89]	DU X Z, LI Y Q, CUI Y, et al. Revisiting 3D ResNets for video recognition[EB/OL]. (2021-09-03) [2022-01-10]. https://arxiv.org/abs/2109.01696.
[90]	HUANG G X, BORS A G. Busy-quiet video disentangling for video classification[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1341-1350.
[91]	LI Y W, LI Y, VASCONCELOS N. RESOUND: towards action recognition without representation bias[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 513-528.
[92]	GOYAL P, DOLLÁR P, GIRSHICK R, et al. Accurate, large minibatch sgd: training imagenet in 1 hour[EB/OL]. (2018-04-30) [2022-01-10]. https://doi.org/10.48550/arXiv.1706.02677.
[93]	LIN J, GAN C, HAN S. Training kinetics in 15 minutes: large-scale distributed training on videos[EB/OL]. (2019-12-07) [2022-01-10]. https://arxiv.org/abs/1910.00932.
[94]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324.
[95]	BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[EB/OL]. [2022-01-20]. https://arxiv.org/abs/2102.05095v2.
[96]	ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846.
[97]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2010.11929.
[98]	YANG J W, DONG X B, LIU L J, et al. Recurring the transformer for video action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14063-14073.
[99]	CHEN J W, HO C M. MM-ViT: multi-modal video transformer for compressed video action recognition[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1910-1921.
[100]	ROIG C, SARMIENTO M, VARAS D, et al. Multi-modal pyramid feature combination for human action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 3742-3746.
[101]	SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 7463-7472.
[102]	WANG M M, XING J Z, LIU Y. ActionCLIP: a new paradigm for video action recognition[EB/OL]. (2021-09-17) [2022-01-10]. https://arxiv.org/abs/2109.08472.

数据集名称	年份	样本数	平均时长	动作类别	引用数
KTH^[57]	2004	2391	4 s	6	4484
HMDB51^[58]	2011	~7000	~5 s	51	3022
UCF101^[59]	2012	13320	~6 s	101	4001
Sports 1M^[60]	2014	1000000	~5.5 m	487	6565
ActivityNet^[61]	2015	27801	[5, 10] m	203	1488
YouTube8M^[62]	2016	~8000000	229.6 s	3826	1001
Charades^[63]	2016	9848	30.1 s	157	730
Kinetics 400^[64]	2017	306245	10 s	400	2029
Kinetics 600^[65]	2018	495547	10 s	600	206
Kinetics 700^[66]	2019	~650000	10 s	700	185
Sth-Sth V1	2017	108499	[2, 6] s	174	-
Sth-Sth V2^[67]	2017	220847	[2, 6] s	174	545
AVA^[68]	2018	>392416	15 m	80	570
AVA-Kinetics^[69]	2020	>238000	15 m, 10 s	80	40
MIT^[70]	2018	~1000000	3 s	339	320
HACS Clips^[71]	2019	~1500000	2 s	31	114
HUV^[72]	2020	~572000	10 s	739	35
AViD^[73]	2020	~450000	[3, 15] s	887	17

数据集名称	年份	样本数	平均时长	动作类别	引用数
KTH^[57]	2004	2391	4 s	6	4484
HMDB51^[58]	2011	~7000	~5 s	51	3022
UCF101^[59]	2012	13320	~6 s	101	4001
Sports 1M^[60]	2014	1000000	~5.5 m	487	6565
ActivityNet^[61]	2015	27801	[5, 10] m	203	1488
YouTube8M^[62]	2016	~8000000	229.6 s	3826	1001
Charades^[63]	2016	9848	30.1 s	157	730
Kinetics 400^[64]	2017	306245	10 s	400	2029
Kinetics 600^[65]	2018	495547	10 s	600	206
Kinetics 700^[66]	2019	~650000	10 s	700	185
Sth-Sth V1	2017	108499	[2, 6] s	174	-
Sth-Sth V2^[67]	2017	220847	[2, 6] s	174	545
AVA^[68]	2018	>392416	15 m	80	570
AVA-Kinetics^[69]	2020	>238000	15 m, 10 s	80	40
MIT^[70]	2018	~1000000	3 s	339	320
HACS Clips^[71]	2019	~1500000	2 s	31	114
HUV^[72]	2020	~572000	10 s	739	35
AViD^[73]	2020	~450000	[3, 15] s	887	17

序号	模型	年份	Top-3	是否有额外的训练数据
1	SMART^[74]	2020	98.64	F
2	LGD-3D two-stream^[35]	2019	98.20	F
3	BubbleNET^[75]	2020	97.62	F
4	D3D + D3D^[76]	2018	97.60	F
5	Multi-stream I3D^[77]	2019	97.20	F
6	Hidden two-stream^[78]	2017	97.10	F
7	TSN^[14]	2016	94.20	F
8	Two-stream I3D^[26]	2017	93.40	F
9	TDD+IDT^[79]	2015	91.50	F
10	Two-stream+LSTM^[80]	2015	88.60	F
11	P3D(ImageNet+Sports1M)^[43]	2017	88.60	R
12	Two-Stream(ImageNet pretrained)^[80]	2014	88.00	R
13	MV-CNN^[81]	2016	86.40	F
14	Res3D^[82]	2017	85.80	F
15	ActionFlowNet^[83]	2016	83.90	F
16	C3D^[25]	2014	82.30	F

序号	模型	年份	Top-3	是否有额外的训练数据
1	SMART^[74]	2020	98.64	F
2	LGD-3D two-stream^[35]	2019	98.20	F
3	BubbleNET^[75]	2020	97.62	F
4	D3D + D3D^[76]	2018	97.60	F
5	Multi-stream I3D^[77]	2019	97.20	F
6	Hidden two-stream^[78]	2017	97.10	F
7	TSN^[14]	2016	94.20	F
8	Two-stream I3D^[26]	2017	93.40	F
9	TDD+IDT^[79]	2015	91.50	F
10	Two-stream+LSTM^[80]	2015	88.60	F
11	P3D(ImageNet+Sports1M)^[43]	2017	88.60	R
12	Two-Stream(ImageNet pretrained)^[80]	2014	88.00	R
13	MV-CNN^[81]	2016	86.40	F
14	Res3D^[82]	2017	85.80	F
15	ActionFlowNet^[83]	2016	83.90	F
16	C3D^[25]	2014	82.30	F

序号	模型	年份	Acc@1	Acc@5	是否有额外的训练数据
1	MTV-H(WT 60M)^[84]	2022	89.1	98.2	R
2	CoWeR(JFT-3B)^[85]	2021	87.2	97.5	R
3	MaskFeat(K600,MViT-L)^[86]	2021	87.0	97.4	R
4	ViViT-H/14x2(JFT)^[87]	2021	84.9	95.8	R
5	Ir-CSN-152(IG-65M^[88])^[36]	2019	82.6	-	R
6	Ip-CSN-152(IG-65M)^[36]	2019	83.5	95.3	R
7	R(2+1)D(IG-65M)^[36]	2019	81.3	95.1	R
8	X3D-XXL^[31]	2020	80.4	94.6	F
9	R3D-RS-200^[89]	2021	80.4	94.4	F
10	SlowFast 16x8 (ResNet-101+NL)^[29]	2018	79.8	-	F
11	TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)^[56]	2020	79.4	94.4	F
12	I3D+NL^[28]	2017	77.7	93.3	F
13	BQN^[90]	2020	77.3	93.2	R
14	TSM^[45]	2018	74.7	-	F
15	R[2+1]D-RGB (Sports-1M pretrain)^[38]	2017	74.3	91.4	R
16	R[2+1]D-Two-Stream^[38]	2017	73.9	90.9	F
17	TSN^[14]	2016	73.9	91.1	F