Journal of Graphics ›› 2023, Vol. 44 ›› Issue (4): 625-639.DOI: 10.11996/JG.j.2095-302X.2023040625
• Review • Previous Articles Next Articles
BI Chun-yan1,2(), LIU Yue1,2()
Received:
2022-10-21
Accepted:
2023-04-01
Online:
2023-08-31
Published:
2023-08-16
Contact:
LIU Yue (1968-), professor, Ph.D. His main research interests cover augmented reality, computer vision, etc. E-mail:About author:
BI Chun-yan (1995-), master student. Her main research interests cover augmented reality, computer vision and video action recognition, etc. E-mail:bichunyan_suda@163.com
Supported by:
CLC Number:
BI Chun-yan, LIU Yue. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2023040625
数据集名称 | 年份 | 样本数 | 平均时长 | 动作类别 | 引用数 |
---|---|---|---|---|---|
KTH[ | 2004 | 2391 | 4 s | 6 | 4484 |
HMDB51[ | 2011 | ~7000 | ~5 s | 51 | 3022 |
UCF101[ | 2012 | 13320 | ~6 s | 101 | 4001 |
Sports 1M[ | 2014 | 1000000 | ~5.5 m | 487 | 6565 |
ActivityNet[ | 2015 | 27801 | [5, 10] m | 203 | 1488 |
YouTube8M[ | 2016 | ~8000000 | 229.6 s | 3826 | 1001 |
Charades[ | 2016 | 9848 | 30.1 s | 157 | 730 |
Kinetics 400[ | 2017 | 306245 | 10 s | 400 | 2029 |
Kinetics 600[ | 2018 | 495547 | 10 s | 600 | 206 |
Kinetics 700[ | 2019 | ~650000 | 10 s | 700 | 185 |
Sth-Sth V1 | 2017 | 108499 | [2, 6] s | 174 | - |
Sth-Sth V2[ | 2017 | 220847 | [2, 6] s | 174 | 545 |
AVA[ | 2018 | >392416 | 15 m | 80 | 570 |
AVA-Kinetics[ | 2020 | >238000 | 15 m, 10 s | 80 | 40 |
MIT[ | 2018 | ~1000000 | 3 s | 339 | 320 |
HACS Clips[ | 2019 | ~1500000 | 2 s | 31 | 114 |
HUV[ | 2020 | ~572000 | 10 s | 739 | 35 |
AViD[ | 2020 | ~450000 | [3, 15] s | 887 | 17 |
Table 1 List of video action recognition datasets
数据集名称 | 年份 | 样本数 | 平均时长 | 动作类别 | 引用数 |
---|---|---|---|---|---|
KTH[ | 2004 | 2391 | 4 s | 6 | 4484 |
HMDB51[ | 2011 | ~7000 | ~5 s | 51 | 3022 |
UCF101[ | 2012 | 13320 | ~6 s | 101 | 4001 |
Sports 1M[ | 2014 | 1000000 | ~5.5 m | 487 | 6565 |
ActivityNet[ | 2015 | 27801 | [5, 10] m | 203 | 1488 |
YouTube8M[ | 2016 | ~8000000 | 229.6 s | 3826 | 1001 |
Charades[ | 2016 | 9848 | 30.1 s | 157 | 730 |
Kinetics 400[ | 2017 | 306245 | 10 s | 400 | 2029 |
Kinetics 600[ | 2018 | 495547 | 10 s | 600 | 206 |
Kinetics 700[ | 2019 | ~650000 | 10 s | 700 | 185 |
Sth-Sth V1 | 2017 | 108499 | [2, 6] s | 174 | - |
Sth-Sth V2[ | 2017 | 220847 | [2, 6] s | 174 | 545 |
AVA[ | 2018 | >392416 | 15 m | 80 | 570 |
AVA-Kinetics[ | 2020 | >238000 | 15 m, 10 s | 80 | 40 |
MIT[ | 2018 | ~1000000 | 3 s | 339 | 320 |
HACS Clips[ | 2019 | ~1500000 | 2 s | 31 | 114 |
HUV[ | 2020 | ~572000 | 10 s | 739 | 35 |
AViD[ | 2020 | ~450000 | [3, 15] s | 887 | 17 |
序号 | 模型 | 年份 | Top-3 | 是否有额外的训练数据 |
---|---|---|---|---|
1 | SMART[ | 2020 | 98.64 | F |
2 | LGD-3D two-stream[ | 2019 | 98.20 | F |
3 | BubbleNET[ | 2020 | 97.62 | F |
4 | D3D + D3D[ | 2018 | 97.60 | F |
5 | Multi-stream I3D[ | 2019 | 97.20 | F |
6 | Hidden two-stream[ | 2017 | 97.10 | F |
7 | TSN[ | 2016 | 94.20 | F |
8 | Two-stream I3D[ | 2017 | 93.40 | F |
9 | TDD+IDT[ | 2015 | 91.50 | F |
10 | Two-stream+LSTM[ | 2015 | 88.60 | F |
11 | P3D(ImageNet+Sports1M)[ | 2017 | 88.60 | R |
12 | Two-Stream(ImageNet pretrained)[ | 2014 | 88.00 | R |
13 | MV-CNN[ | 2016 | 86.40 | F |
14 | Res3D[ | 2017 | 85.80 | F |
15 | ActionFlowNet[ | 2016 | 83.90 | F |
16 | C3D[ | 2014 | 82.30 | F |
Table 2 Comparison of training accuracy of different models on UCF101 dataset
序号 | 模型 | 年份 | Top-3 | 是否有额外的训练数据 |
---|---|---|---|---|
1 | SMART[ | 2020 | 98.64 | F |
2 | LGD-3D two-stream[ | 2019 | 98.20 | F |
3 | BubbleNET[ | 2020 | 97.62 | F |
4 | D3D + D3D[ | 2018 | 97.60 | F |
5 | Multi-stream I3D[ | 2019 | 97.20 | F |
6 | Hidden two-stream[ | 2017 | 97.10 | F |
7 | TSN[ | 2016 | 94.20 | F |
8 | Two-stream I3D[ | 2017 | 93.40 | F |
9 | TDD+IDT[ | 2015 | 91.50 | F |
10 | Two-stream+LSTM[ | 2015 | 88.60 | F |
11 | P3D(ImageNet+Sports1M)[ | 2017 | 88.60 | R |
12 | Two-Stream(ImageNet pretrained)[ | 2014 | 88.00 | R |
13 | MV-CNN[ | 2016 | 86.40 | F |
14 | Res3D[ | 2017 | 85.80 | F |
15 | ActionFlowNet[ | 2016 | 83.90 | F |
16 | C3D[ | 2014 | 82.30 | F |
序号 | 模型 | 年份 | Acc@1 | Acc@5 | 是否有额外的训练数据 |
---|---|---|---|---|---|
1 | MTV-H(WT 60M)[ | 2022 | 89.1 | 98.2 | R |
2 | CoWeR(JFT-3B)[ | 2021 | 87.2 | 97.5 | R |
3 | MaskFeat(K600,MViT-L)[ | 2021 | 87.0 | 97.4 | R |
4 | ViViT-H/14x2(JFT)[ | 2021 | 84.9 | 95.8 | R |
5 | Ir-CSN-152(IG-65M[ | 2019 | 82.6 | - | R |
6 | Ip-CSN-152(IG-65M)[ | 2019 | 83.5 | 95.3 | R |
7 | R(2+1)D(IG-65M)[ | 2019 | 81.3 | 95.1 | R |
8 | X3D-XXL[ | 2020 | 80.4 | 94.6 | F |
9 | R3D-RS-200[ | 2021 | 80.4 | 94.4 | F |
10 | SlowFast 16x8 (ResNet-101+NL)[ | 2018 | 79.8 | - | F |
11 | TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)[ | 2020 | 79.4 | 94.4 | F |
12 | I3D+NL[ | 2017 | 77.7 | 93.3 | F |
13 | BQN[ | 2020 | 77.3 | 93.2 | R |
14 | TSM[ | 2018 | 74.7 | - | F |
15 | R[2+1]D-RGB (Sports-1M pretrain)[ | 2017 | 74.3 | 91.4 | R |
16 | R[2+1]D-Two-Stream[ | 2017 | 73.9 | 90.9 | F |
17 | TSN[ | 2016 | 73.9 | 91.1 | F |
Table 3 Comparison of training accuracy of different models on the Kinetics 400 dataset
序号 | 模型 | 年份 | Acc@1 | Acc@5 | 是否有额外的训练数据 |
---|---|---|---|---|---|
1 | MTV-H(WT 60M)[ | 2022 | 89.1 | 98.2 | R |
2 | CoWeR(JFT-3B)[ | 2021 | 87.2 | 97.5 | R |
3 | MaskFeat(K600,MViT-L)[ | 2021 | 87.0 | 97.4 | R |
4 | ViViT-H/14x2(JFT)[ | 2021 | 84.9 | 95.8 | R |
5 | Ir-CSN-152(IG-65M[ | 2019 | 82.6 | - | R |
6 | Ip-CSN-152(IG-65M)[ | 2019 | 83.5 | 95.3 | R |
7 | R(2+1)D(IG-65M)[ | 2019 | 81.3 | 95.1 | R |
8 | X3D-XXL[ | 2020 | 80.4 | 94.6 | F |
9 | R3D-RS-200[ | 2021 | 80.4 | 94.4 | F |
10 | SlowFast 16x8 (ResNet-101+NL)[ | 2018 | 79.8 | - | F |
11 | TDN-ResNet101 (ensemble, ImageNet pretrained, RGB only)[ | 2020 | 79.4 | 94.4 | F |
12 | I3D+NL[ | 2017 | 77.7 | 93.3 | F |
13 | BQN[ | 2020 | 77.3 | 93.2 | R |
14 | TSM[ | 2018 | 74.7 | - | F |
15 | R[2+1]D-RGB (Sports-1M pretrain)[ | 2017 | 74.3 | 91.4 | R |
16 | R[2+1]D-Two-Stream[ | 2017 | 73.9 | 90.9 | F |
17 | TSN[ | 2016 | 73.9 | 91.1 | F |
Method | Pretrained | Top@3/ HMDB-51 | Top3/ Ucf101 | Top5/ Ucf101 |
---|---|---|---|---|
Two_Stream I3D | ImageNet+Kinetics pre-training | 80.7 | 98.0 | 88.8 |
Two_Stream I3D | ImageNet pre-training | 66.4 | 93.4 | 88.8 |
Two_Stream I3D | Kinetics pre-training | 80.9 | 97.8 | 88.8 |
R[2+1]D-TwoStream | Kinetics pre-training | 78.7 | 97.3 | 90.9 |
R[2+1]D-TwoStream | Sports-1M pretrained | 72.7 | 95.0 | 90.9 |
R[2+1]D-RGB | Kinetics pretrained | 74.5 | 96.8 | 90.0 |
R[2+1]D-RGB | Sports-1M pretrained | 66.6 | 93.6 | 90.0 |
R[2+1]D-Flow | Kinetics pretrained | 76.4 | 95.5 | 87.2 |
R[2+1]D-Flow | Sports-1M pretrained | 70.1 | 93.3 | 87.2 |
Two-Stream | ImageNet pretrained | 59.4 | 88.0 | - |
Table 4 The comparison of training accuracy of different models on UCF101 dataset and HMDB-51 dataset
Method | Pretrained | Top@3/ HMDB-51 | Top3/ Ucf101 | Top5/ Ucf101 |
---|---|---|---|---|
Two_Stream I3D | ImageNet+Kinetics pre-training | 80.7 | 98.0 | 88.8 |
Two_Stream I3D | ImageNet pre-training | 66.4 | 93.4 | 88.8 |
Two_Stream I3D | Kinetics pre-training | 80.9 | 97.8 | 88.8 |
R[2+1]D-TwoStream | Kinetics pre-training | 78.7 | 97.3 | 90.9 |
R[2+1]D-TwoStream | Sports-1M pretrained | 72.7 | 95.0 | 90.9 |
R[2+1]D-RGB | Kinetics pretrained | 74.5 | 96.8 | 90.0 |
R[2+1]D-RGB | Sports-1M pretrained | 66.6 | 93.6 | 90.0 |
R[2+1]D-Flow | Kinetics pretrained | 76.4 | 95.5 | 87.2 |
R[2+1]D-Flow | Sports-1M pretrained | 70.1 | 93.3 | 87.2 |
Two-Stream | ImageNet pretrained | 59.4 | 88.0 | - |
[1] | 陈万军, 张二虎. 基于深度信息的人体动作识别研究综述[J]. 西安理工大学学报, 2015, 31(3): 253-264, 250. |
CHEN W J, ZHANG E H. A review for human action recognition based on depth data[J]. Journal of Xi’an University of Technology, 2015, 31(3): 253-264, 250 (in Chinese). | |
[2] | 杜友田, 陈峰, 徐文立, 等. 基于视觉的人的运动识别综述[J]. 电子学报, 2007, 35(1): 84-90. |
DU Y T, CHEN F, XU W L, et al. A survey on the vision-based human motion recognition[J]. Acta Electronica Sinica, 2007, 35(1): 84-90 (in Chinese). | |
[3] | 胡琼, 秦磊, 黄庆明. 基于视觉的人体动作识别综述[J]. 计算机学报, 2013, 36(12): 2512-2524. |
HU Q, QIN L, HUANG Q M. A survey on visual human action recognition[J]. Chinese Journal of Computers, 2013, 36(12): 2512-2524 (in Chinese).
DOI URL |
|
[4] | 李瑞峰, 王亮亮, 王珂. 人体动作行为识别研究综述[J]. 模式识别与人工智能, 2014, 27(1): 35-48. |
LI R F, WANG L L, WANG K. A survey of human body action recognition[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(1): 35-48 (in Chinese). | |
[5] | 黄国范, 李亚. 人体动作姿态识别综述[J]. 电脑知识与技术, 2013, 9(1): 133-135. |
HUANG G F, LI Y. A survey of human action and pose recognition[J]. Computer Knowledge and Technology, 2013, 9(1): 133-135 (in Chinese). | |
[6] |
罗会兰, 王婵娟, 卢飞. 视频行为识别综述[J]. 通信学报, 2018, 39(6): 169-180.
DOI |
LUO H L, WANG C J, LU F. Survey of video behavior recognition[J]. Journal on Communications, 2018, 39(6): 169-180 (in Chinese).
DOI |
|
[7] |
钱慧芳, 易剑平, 付云虎. 基于深度学习的人体动作识别综述[J]. 计算机科学与探索, 2021, 15(3): 438-455.
DOI |
QIAN H F, YI J P, FU Y H. Review of human action recognition based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 438-455 (in Chinese). | |
[8] | 钱文祥, 衣杨. 视频识别深度学习网络综述[J]. 计算机科学, 2022, 49(S2): 341-350. |
QIAN W X, YI Y. Summary of video recognition deep learning network[J]. Computer Science, 2022, 49(S2): 341-350 (in Chinese). | |
[9] |
罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报, 2019, 47(5): 1162-1173.
DOI |
LUO H L, TONG K, KONG F S. The progress of human action recognition in videos based on deep learning: a review[J]. Acta Electronica Sinica, 2019, 47(5): 1162-1173 (in Chinese). | |
[10] | 黄晴晴, 周风余, 刘美珍. 基于视频的人体动作识别算法综述[J]. 计算机应用研究, 2020, 37(11): 3213-3219. |
HUANG Q Q, ZHOU F Y, LIU M Z. Survey of human action recognition algorithms based on video[J]. Application Research of Computers, 2020, 37(11): 3213-3219 (in Chinese). | |
[11] | SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// The 27th International Conference on Neural Information Processing Systems. New York: ACM, 2014: 568-576. |
[12] | FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 1933-1941. |
[13] | FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal multiplier networks for video action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4768-4777. |
[14] | WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 20-36. |
[15] | IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]// The 32nd International Conference on International Conference on Machine Learning - Volume 37. New York:ACM, 2015: 448-456. |
[16] | LAN Z Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 1219-1225. |
[17] | ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 803-818. |
[18] |
XU B H, YE H, ZHENG Y B, et al. Dense dilated network for video action recognition[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2019, 28(10): 4941-4953.
DOI URL |
[19] | SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]// The 28th International Conference on Neural Information Processing Systems - Volume 1. New York:ACM, 2015: 802-810. |
[20] |
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
DOI PMID |
[21] | DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 2625-2634. |
[22] | SUN L, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2147-2156. |
[23] | BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]// International Workshop on Human Behavior Understanding. Heidelberg: Springer, 2011: 29-39. |
[24] |
JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
DOI URL |
[25] | TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4489-4497. |
[26] | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6299-6308. |
[27] | KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 12103-12112. |
[28] | WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7794-7803. |
[29] | FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 6202-6211. |
[30] | XIAO F, LEE Y J, GRAUMAN K, et al. Audiovisual SlowFast networks for video recognition[EB/OL]. (2020-03-09) [2022-01-09]. https://doi.org/10.48550/arXiv.2001.08740. |
[31] | FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 203-213. |
[32] | ZHU S J, YANG T, MENDIETA M, et al. A3D: adaptive 3D networks for video action recognition[EB/OL]. (2020-11-24) [2022-01-09]. https://arxiv.org/abs/2011.12384. |
[33] | DIBA, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. (2017-11-22) [2022-01-09]. https://arxiv.org/abs/1711.08200. |
[34] |
HE D L, ZHOU Z C, GAN C, et al. StNet: local and global spatial-temporal modeling for action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8401-8408.
DOI URL |
[35] | QIU Z F, YAO T, NGO C W, et al. Learning spatio-temporal representation with local and global diffusion[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12056-12065. |
[36] | TRAN D, WANG H, FEISZLI M, et al. Video classification with channel-separated convolutional networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5552-5561. |
[37] | SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]// 2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 4597-4605. |
[38] | TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6450-6459. |
[39] | XIE S N, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 305-321. |
[40] | ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C]// Computer Vision - ECCV 2018: 15th European Conference. New York: ACM, 2018: 695-712. |
[41] | LI K C, LI X H, WANG Y L, et al. CT-net: channel tensorization network for video classification[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2106.01603. |
[42] | DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3D convnets using temporal transition layer[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2018:1117-1121. |
[43] | QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5534-5542. |
[44] | DIBA A L, FAYYAZ M, SHARMA V, et al. Spatio-temporal channel correlation networks for action classification[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 284-299. |
[45] | LIN J, GAN C, HAN S. Tsm: temporal shift module for efficient video understanding[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 7083-7093. |
[46] |
SHAO H, QIAN S J, LIU Y. Temporal interlacing network[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11966-11973.
DOI URL |
[47] | JIANG B Y, WANG M M, GAN W H, et al. STM: SpatioTemporal and motion encoding for action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2000-2009. |
[48] | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778. |
[49] | LI Y, JI B, SHI X T, et al. TEA: temporal excitation and aggregation for action recognition[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern. New York: IEEE Press, 2020: 909-918. |
[50] | HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141. |
[51] |
LIU Z Y, LUO D H, WANG Y B, et al. TEINet: towards an efficient architecture for video recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11669-11676.
DOI URL |
[52] | LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 13708-13718. |
[53] |
WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks for action recognition in videos[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(11): 2740-2755.
DOI URL |
[54] | NG J Y H, DAVIS L S. Temporal difference networks for video action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1587-1596. |
[55] | ZHAO Y, XIONG Y J, LIN D H. Recognize actions by disentangling components of dynamics[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6566-6575. |
[56] | WANG L M, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1895-1904. |
[57] | SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]// The 17th International Conference on Pattern Recognition. New York: IEEE Press, 2004: 32-36. |
[58] | KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// 2011 International Conference on Computer Vision. New York: IEEE Press, 2011: 2556-2563. |
[59] | SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. (2012-12-03) [2022-01-10]. https://arxiv.org/abs/1212.0402. |
[60] | KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]// 2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2014: 1725-1732. |
[61] | HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 961-970. |
[62] | ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. YouTube-8M: a large-scale video classification benchmark[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1609.08675. |
[63] | SIGURDSSON G A, VAROL G, WANG X L, et al. Hollywood in homes: crowdsourcing data collection for activity understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 510-526. |
[64] | KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL]. (2016-09-27) [2022-01-10]. https://arxiv.org/abs/1705.06950. |
[65] | CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[EB/OL]. (2018-08-03) [2022-01-10]. https://arxiv.org/abs/1808.01340. |
[66] | CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[EB/OL]. (2022-10-17) [2022-01-10]. https://doi.org/10.48550/arXiv.1907.06987. |
[67] | GOYAL R, KAHOU S E, MICHALSKI V, et al. The “something something” video database for learning and evaluating visual common sense[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 5842-5850. |
[68] | GU C H, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6047-6056. |
[69] | LI A, THOTAKURI M, ROSS D A, et al. The AVA-kinetics localized human actions video dataset[EB/OL]. (2020-05-20) [2022-01-10]. https://arxiv.org/abs/2005.00214. |
[70] |
MONFORT M, ANDONIAN A, ZHOU B L, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(2): 502-508.
DOI URL |
[71] | ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 8668-8678. |
[72] | DIBA A L, FAYYAZ M, SHARMA V, et al. Large scale holistic video understanding[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 593-610. |
[73] | PIERGIOVANNI A, RYOO M S. AViD dataset: anonymized videos from diverse countries[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 16711-16721. |
[74] | GOWDA S N, ROHRBACH M, SEVILLA-LARA L. SMART frame selection for action recognition[C]// 2020 AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1451-1459. |
[75] | IGOR L O B, VICTOR H C M, SCHWARTZ W R. Bubblenet: a disperse recurrent structure to recognize activities[C]// 2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2216-2220. |
[76] | STROUD J C, ROSS D A, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]// 2020 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2020: 625-634. |
[77] | HONG J, CHO B, HONG Y W, et al. Contextual action cues from camera sensor for multi-stream action recognition[J]. Sensors: Basel, Switzerland, 2019, 19(6): 1382. |
[78] | ZHU Y, LAN Z Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]// Asian Conference on Computer Vision. Perth: Springer, 2019: 363-378. |
[79] | WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4305-4314. |
[80] | NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 4694-4702. |
[81] | ZHANG B W, WANG L M, WANG Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2718-2726. |
[82] | TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. (2017-08-16) [2022-01-10]. https://arxiv.org/abs/1708.05038. |
[83] | NG J Y H, CHOI J, NEUMANN J, et al. ActionFlowNet: learning motion representation for action recognition[C]// 2018 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2018: 1616-1624. |
[84] | YAN S, XIONG X H, ARNAB A, et al. Multiview transformers for video recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 3333-3343. |
[85] | ZHANG B W, YU J H, FIFTY C, et al. Co-training transformer with videos and images improves action recognition[EB/OL]. (2021-12-14) [2022-01-10]. https://arxiv.org/abs/2112.07175. |
[86] | WEI C, FAN H Q, XIE S N, et al. Masked feature prediction for self-supervised visual pre-training[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14668-14678. |
[87] | ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846. |
[88] | GHADIYARAM D, TRAN D, MAHAJAN D. Large-scale weakly-supervised pre-training for video action recognition[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 12046-12055. |
[89] | DU X Z, LI Y Q, CUI Y, et al. Revisiting 3D ResNets for video recognition[EB/OL]. (2021-09-03) [2022-01-10]. https://arxiv.org/abs/2109.01696. |
[90] | HUANG G X, BORS A G. Busy-quiet video disentangling for video classification[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1341-1350. |
[91] | LI Y W, LI Y, VASCONCELOS N. RESOUND: towards action recognition without representation bias[C]// European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 513-528. |
[92] | GOYAL P, DOLLÁR P, GIRSHICK R, et al. Accurate, large minibatch sgd: training imagenet in 1 hour[EB/OL]. (2018-04-30) [2022-01-10]. https://doi.org/10.48550/arXiv.1706.02677. |
[93] | LIN J, GAN C, HAN S. Training kinetics in 15 minutes: large-scale distributed training on videos[EB/OL]. (2019-12-07) [2022-01-10]. https://arxiv.org/abs/1910.00932. |
[94] | HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324. |
[95] | BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[EB/OL]. [2022-01-20]. https://arxiv.org/abs/2102.05095v2. |
[96] | ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6836-6846. |
[97] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. (2021-06-03) [2022-01-10]. https://arxiv.org/abs/2010.11929. |
[98] | YANG J W, DONG X B, LIU L J, et al. Recurring the transformer for video action recognition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 14063-14073. |
[99] | CHEN J W, HO C M. MM-ViT: multi-modal video transformer for compressed video action recognition[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 1910-1921. |
[100] | ROIG C, SARMIENTO M, VARAS D, et al. Multi-modal pyramid feature combination for human action recognition[C]// 2019 IEEE/CVF International Conference on Computer Vision Workshop. New York: IEEE Press, 2019: 3742-3746. |
[101] | SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2020: 7463-7472. |
[102] | WANG M M, XING J Z, LIU Y. ActionCLIP: a new paradigm for video action recognition[EB/OL]. (2021-09-17) [2022-01-10]. https://arxiv.org/abs/2109.08472. |
[1] |
YANG Chen-cheng, DONG Xiu-cheng, HOU Bing, ZHANG Dang-cheng, XIANG Xian-ming, FENG Qi-ming.
Reference based transformer texture migrates depth imagessuper resolution reconstruction
[J]. Journal of Graphics, 2023, 44(5): 861-867.
|
[2] |
DANG Hong-she, XU Huai-biao, ZHANG Xuan-de.
Deep learning stereo matching algorithm fusing structural information
[J]. Journal of Graphics, 2023, 44(5): 899-906.
|
[3] |
ZHAI Yong-jie, GUO Cong-bin, WANG Qian-ming, ZHAO Kuan, BAI Yun-shan, ZHANG Ji.
Multi-fitting detection method for transmission lines based onimplicit spatial knowledge fusion
[J]. Journal of Graphics, 2023, 44(5): 918-927.
|
[4] |
YANG Hong-ju, GAO Min, ZHANG Chang-you, BO Wen, WU Wen-jia, CAO Fu-yuan.
A local optimization generation model for image inpainting
[J]. Journal of Graphics, 2023, 44(5): 955-965.
|
[5] | CAO Yi-qin, ZHOU Yi-wei, XU Lu. A real-time metallic surface defect detection algorithm based on E-YOLOX [J]. Journal of Graphics, 2023, 44(4): 677-690. |
[6] | LI Xin, PU Yuan-yuan, ZHAO Zheng-peng, XU Dan, QIAN Wen-hua. Content semantics and style features match consistent artistic style transfer [J]. Journal of Graphics, 2023, 44(4): 699-709. |
[7] | SHAO Jun-qi, QIAN Wen-hua, XU Qi-hao. Landscape image generation based on conditional residual generative adversarial network [J]. Journal of Graphics, 2023, 44(4): 710-717. |
[8] | DENG Wei-ming, YANG Tie-jun, LI Chun-chun, HUANG Lin. Object detection for nameplate based on neural architecture search [J]. Journal of Graphics, 2023, 44(4): 718-727. |
[9] | YU Wei-qun, LIU Jia-tao, ZHANG Ya-ping. Monocular depth estimation based on Laplacian pyramid with attention fusion [J]. Journal of Graphics, 2023, 44(4): 728-738. |
[10] | GUO Yin-hong, WANG Li-chun, LI Shuang. Image feature matching based on repeatability and specificity constraints [J]. Journal of Graphics, 2023, 44(4): 739-746. |
[11] | MAO Ai-kun, LIU Xin-ming, CHEN Wen-zhuang, SONG Shao-lou. Improved substation instrument target detection method for YOLOv5 algorithm [J]. Journal of Graphics, 2023, 44(3): 448-455. |
[12] | WANG Jia-jing, WANG Chen, ZHU Yuan-yuan, WANG Xiao-mei. Graph element detection matching based on Republic of China banknotes [J]. Journal of Graphics, 2023, 44(3): 492-501. |
[13] | YANG Liu, WU Xiao-qun. 3D shape completion via deep learning: a method survey [J]. Journal of Graphics, 2023, 44(2): 201-215. |
[14] | ZENG Wu, ZHU Heng-liang, XING Shu-li, LIN Jiang-hong, MAO Guo-jun. Saliency detection-guided for image data augmentation [J]. Journal of Graphics, 2023, 44(2): 260-270. |
[15] | LUO Qi-ming, WU Hao, XIA Xin, YUAN Guo-wu. Prediction of damaged areas in Yunnan murals using Dual Dense U-Net [J]. Journal of Graphics, 2023, 44(2): 304-312. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||