图学学报 ›› 2023, Vol. 44 ›› Issue (1): 104-111.DOI: 10.11996/JG.j.2095-302X.2023010104
收稿日期:
2022-06-17
修回日期:
2022-07-07
出版日期:
2023-10-31
发布日期:
2023-02-16
作者简介:
黄志勇(1979-),男,副教授,博士。主要研究方向为计算机视觉、计算机图形学。E-mail:hzy@hzy.org.cn
基金资助:
HUANG Zhi-yong(), HAN Sha-sha, CHEN Zhi-jun, YAO Yu, XIONG Biao, MA Kai
Received:
2022-06-17
Revised:
2022-07-07
Online:
2023-10-31
Published:
2023-02-16
About author:
HUANG Zhi-yong (1979-), associate professor, Ph.D. His main research interests cover computer vision and computer graphics. E-mail:hzy@hzy.org.cn
Supported by:
摘要:
在半监督的分割任务中,单镜头视频对象分割(OSVOS)方法根据第一帧的对象标记掩模进行引导,从视频画面中分离出后续帧中的前景对象。虽然取得了令人印象深刻的分割结果,但其不适用于前景对象外观变化显著或前景对象与背景外观相似的情形。针对这些问题,提出一种用于视频对象分割的仿U形网络结构。将注意力机制加入到此网络的编码器和解码器之间,以便在特征图之间建立关联来产生全局语义信息。同时,优化损失函数,进一步解决了类别间的不平衡问题,提高了模型的鲁棒性。此外,还将多尺度预测与全连接条件随机场(FC/Dense CRF)结合,提高了分割结果边缘的平滑度。在具有挑战性的DAVIS 2016数据集上进行了大量实验,此方法与其他最先进方法相比获得了具有竞争力的分割结果。
中图分类号:
黄志勇, 韩莎莎, 陈致君, 姚玉, 熊彪, 马凯. 一种用于视频对象分割的仿U形网络[J]. 图学学报, 2023, 44(1): 104-111.
HUANG Zhi-yong, HAN Sha-sha, CHEN Zhi-jun, YAO Yu, XIONG Biao, MA Kai. An imitation U-shaped network for video object segmentation[J]. Journal of Graphics, 2023, 44(1): 104-111.
Method | OL | J&F Mean[%] | J Mean[%] | F Mean[%] |
---|---|---|---|---|
OSMN[ | × | 73.45 | 74.00 | 72.90 |
FAVOS[ | × | 80.95 | 82.40 | 79.50 |
RGMP[ | × | 81.75 | 81.50 | 82.00 |
FEELVOS[ | × | 81.65 | 81.10 | 82.20 |
CRVOS[ | × | 81.60 | 82.20 | 81.00 |
SAT[ | × | 83.10 | 82.60 | 83.60 |
RANet[ | × | 85.50 | 85.50 | 85.40 |
MaskTrack[ | √ | 77.55 | 79.70 | 75.40 |
OSVOS[ | √ | 80.20 | 79.80 | 80.60 |
FRTMVOS[ | √ | 83.50 | - | - |
LucidTracker[ | √ | 83.60 | 84.80 | 82.30 |
STCNN[ | √ | 83.80 | 83.80 | 83.80 |
OnAVOS[ | √ | 84.95 | 85.70 | 84.20 |
PReMVOS[ | √ | 86.75 | 84.90 | 88.60 |
CINM[ | √ | 84.20 | 83.40 | 85.00 |
MHPVOS[ | √ | 88.55 | 87.60 | 89.50 |
Ours | √ | 87.07 | 86.26 | 87.88 |
表1 与最新技术比较的结果
Table 1 Results compared with the state-of-the-art methods
Method | OL | J&F Mean[%] | J Mean[%] | F Mean[%] |
---|---|---|---|---|
OSMN[ | × | 73.45 | 74.00 | 72.90 |
FAVOS[ | × | 80.95 | 82.40 | 79.50 |
RGMP[ | × | 81.75 | 81.50 | 82.00 |
FEELVOS[ | × | 81.65 | 81.10 | 82.20 |
CRVOS[ | × | 81.60 | 82.20 | 81.00 |
SAT[ | × | 83.10 | 82.60 | 83.60 |
RANet[ | × | 85.50 | 85.50 | 85.40 |
MaskTrack[ | √ | 77.55 | 79.70 | 75.40 |
OSVOS[ | √ | 80.20 | 79.80 | 80.60 |
FRTMVOS[ | √ | 83.50 | - | - |
LucidTracker[ | √ | 83.60 | 84.80 | 82.30 |
STCNN[ | √ | 83.80 | 83.80 | 83.80 |
OnAVOS[ | √ | 84.95 | 85.70 | 84.20 |
PReMVOS[ | √ | 86.75 | 84.90 | 88.60 |
CINM[ | √ | 84.20 | 83.40 | 85.00 |
MHPVOS[ | √ | 88.55 | 87.60 | 89.50 |
Ours | √ | 87.07 | 86.26 | 87.88 |
图5 定性结果比较
Fig. 5 Comparison of qualitative results ((a) Ours; (b) MHPVOS; (c) CINM; (d) FEELVOS; (e) FAVOS; (f) OSVOS; (g) MaskTrack; (h) LucidTracker; (i) Ground truth)
Method | J Mean | F Mean |
---|---|---|
Ours | 86.26 | 87.88 |
DA | 84.29 | 84.58 |
Dense CRF | 74.03 | 74.56 |
表2 在DAVIS 2016验证集上的消融实验(%)
Table 2 Ablation experiments on the DAVIS 2016 validation dataset (%)
Method | J Mean | F Mean |
---|---|---|
Ours | 86.26 | 87.88 |
DA | 84.29 | 84.58 |
Dense CRF | 74.03 | 74.56 |
Method | J Mean | F Mean |
---|---|---|
Ours | 74.03 | 74.56 |
Original | 73.79 | 73.35 |
表3 损失函数的消融实验(%)
Table 3 Ablation experiment on loss function (%)
Method | J Mean | F Mean |
---|---|---|
Ours | 74.03 | 74.56 |
Original | 73.79 | 73.35 |
图7 Paragliding-launch与Kite-surf视频序列的定性结果
Fig. 7 Qualitative results of paragliding launch and kite surf video sequences ((a) Frame 8; (b) Frame 18; (c) Frame 28)
[1] | CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5320-5329. |
[2] | JAIN S D, XIONG B, GRAUMAN K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2126. |
[3] |
KHOREVA A, BENENSON R, ILG E, et al. Lucid data dreaming for video object segmentation[J]. International Journal of Computer Vision, 2019, 127(9): 1175-1197.
DOI |
[4] | PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3491-3500. |
[5] | KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. (2014-12-22) [2022-01-30].https://arxiv.org/abs/1412.6980. |
[6] | HELD D, THRUN S, SAVARESE S. Learning to track at 100 FPS with deep regression networks[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 749-765. |
[7] | NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4293-4302. |
[8] | VOIGTLAENDER P, LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[C]//The British Machine Vision Conference 2017. Durham University: British Machine Vision Association, 2017: 1-13. |
[9] | GRIFFIN B A, CORSO J J. BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8906-8915. |
[10] | SHARIR G, SMOLYANSKY E, FRIEDMAN I. Video object segmentation using tracked object proposals[EB/OL]. [2022-01-03].https://arxiv.org/abs/1707.06545. |
[11] | CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL]. [2022-01-03]. https://arxiv.org/abs/1412.7062. |
[12] | HU Y T, HUANG J B, SCHWING A. Maskrnn: Instance level video object segmentation[C]//Neural Information Processing Systems. California: MIT Press, 2017: 325-334. |
[13] | MÄRKI N, PERAZZI F, WANG O, et al. Bilateral space video segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 743-751. |
[14] | JAMPANI V, GADDE R, GEHLER P V. Video propagation networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3154-3164. |
[15] | CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424. |
[16] | YANG L J, WANG Y R, XIONG X H, et al. Efficient video object segmentation via network modulation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6499-6507. |
[17] | XIAO H X, FENG J S, LIN G S, et al. MoNet: deep motion exploitation for video object segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 1140-1148. |
[18] | LUITEN J, VOIGTLAENDER P, LEIBE B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[M]//Computer Vision - ACCV 2018. Cham: Springer International Publishing, 2018: 565-580. |
[19] | HU Y T, HUANG J B, SCHWING A G. VideoMatch: matching based video object segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 56-73. |
[20] | VOIGTLAENDER P, CHAI Y N, SCHROFF F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 9473-9482. |
[21] | OH S W, LEE J Y, SUNKAVALLI K, et al. Fast video object segmentation by reference-guided mask propagation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7376-7385. |
[22] | OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9226-9235. |
[23] | JOHNANDER J, DANELLJAN M, BRISSMAN E, et al. A generative appearance model for end-to-end video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8945-8954. |
[24] | LIN H J, QI X J, JIA J Y. AGSS-VOS: attention guided single-shot video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3948-3956. |
[25] | ZENG X H, LIAO R J, GU L, et al. DMM-net: differentiable mask-matching network for video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3929-3938. |
[26] | WANG Z Q, XU J, LIU L, et al. RANet: ranking attention network for fast video object segmentation[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3978-3987. |
[27] | RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241. |
[28] |
BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
DOI PMID |
[29] | CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17) [2021-12-05].https://arxiv.org/abs/1706.05587. |
[30] | CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 801-818. |
[31] | HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324. |
[32] | SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04) [2022-01-10].https://arxiv.org/abs/1409.1556. |
[33] | FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154. |
[34] |
SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651.
DOI PMID |
[35] | SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9. |
[36] | KRÄHENBÜHL P, KOLTUN V.Efficient inference in fully connected CRFs with Gaussian edge potentials[C]// The 24th International Conference on Neural Information Processing Systems. New York: ACM, 2012: 109-117. |
[37] | XIE S N, TU Z W. Holistically-nested edge detection[C]//2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1395-1403. |
[38] |
MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Convolutional oriented boundaries: from image segmentation to high-level tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 819-833.
DOI URL |
[39] | MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Deep retinal image understanding[M]//Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016. Cham: Springer International Publishing, 2016: 140-148. |
[40] | PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 724-732. |
[41] | PONT-TUSET J, PERAZZI F, CAELLES S, et al. The 2017 DAVIS challenge on video object segmentation[EB/OL]. (2017-04-03) [2022-01-10].https://arxiv.org/abs/1704.00675. |
[42] | CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424. |
[43] | CHO S, CHO M, CHUNG T Y, et al. Crvos: clue refining network for video object segmentation[C]//2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2301-2305. |
[44] | CHEN X, LI Z X, YUAN Y, et al. State-aware tracker for real-time video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9384-9393. |
[45] | ROBINSON A, JÄREMO LAWIN F, DANELLJAN M, et al. Learning fast and robust target models for video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7406-7415. |
[46] | XU K, WEN L Y, LI G R, et al. Spatiotemporal CNN for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1379-1388. |
[47] | BAO L C, WU B Y, LIU W.CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5977-5986. |
[48] | XU S J, LIU D Z, BAO L C, et al. MHP-VOS: multiple hypotheses propagation for video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 314-323. |
[1] | 杨陈成, 董秀成, 侯兵, 张党成, 向贤明, 冯琪茗. 基于参考的Transformer纹理迁移深度图像超分辨率重建[J]. 图学学报, 2023, 44(5): 861-867. |
[2] | 皮骏, 牛厚兴, 高志云. 融合CA-BiFPN的轻量化人体姿态估计算法[J]. 图学学报, 2023, 44(5): 868-878. |
[3] | 宋焕生, 文雅, 孙士杰, 宋翔宇, 张朝阳, 李旭. 基于改进教师学生网络的隧道火灾检测[J]. 图学学报, 2023, 44(5): 978-987. |
[4] | 李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666. |
[5] | 郝帅, 赵新生, 马旭, 张旭, 何田, 侯李祥. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676. |
[6] | 李鑫, 普园媛, 赵征鹏, 徐丹, 钱文华. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709. |
[7] | 余伟群, 刘佳涛, 张亚萍. 融合注意力的拉普拉斯金字塔单目深度估计[J]. 图学学报, 2023, 44(4): 728-738. |
[8] | 胡欣, 周运强, 肖剑, 杨杰. 基于改进YOLOv5的螺纹钢表面缺陷检测[J]. 图学学报, 2023, 44(3): 427-437. |
[9] | 李刚, 张运涛, 汪文凯, 张东阳. 采用DETR与先验知识融合的输电线路螺栓缺陷检测方法[J]. 图学学报, 2023, 44(3): 438-447. |
[10] | 郝鹏飞, 刘立群, 顾任远. YOLO-RD-Apple果园异源图像遮挡果实检测模型[J]. 图学学报, 2023, 44(3): 456-464. |
[11] | 罗文宇, 傅明月. 基于YoloX-ECA模型的非法野泳野钓现场监测技术[J]. 图学学报, 2023, 44(3): 465-472. |
[12] | 李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481. |
[13] | 孙龙飞, 刘慧, 杨奉常, 李攀. 面向医学图像层间插值的循环生成网络研究[J]. 图学学报, 2023, 44(3): 502-512. |
[14] | 吴文欢, 张淏坤. 融合空间十字注意力与通道注意力的语义分割网络[J]. 图学学报, 2023, 44(3): 531-539. |
[15] | 陆秋, 邵铧泽, 张云磊. 动态平衡多尺度特征融合的结直肠息肉分割[J]. 图学学报, 2023, 44(2): 225-232. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||