一种用于视频对象分割的仿U形网络

doi:10.11996/JG.j.2095-302X.2023010104

摘要/Abstract

摘要：

在半监督的分割任务中，单镜头视频对象分割(OSVOS)方法根据第一帧的对象标记掩模进行引导，从视频画面中分离出后续帧中的前景对象。虽然取得了令人印象深刻的分割结果，但其不适用于前景对象外观变化显著或前景对象与背景外观相似的情形。针对这些问题，提出一种用于视频对象分割的仿U形网络结构。将注意力机制加入到此网络的编码器和解码器之间，以便在特征图之间建立关联来产生全局语义信息。同时，优化损失函数，进一步解决了类别间的不平衡问题，提高了模型的鲁棒性。此外，还将多尺度预测与全连接条件随机场(FC/Dense CRF)结合，提高了分割结果边缘的平滑度。在具有挑战性的DAVIS 2016数据集上进行了大量实验，此方法与其他最先进方法相比获得了具有竞争力的分割结果。

关键词: 半监督视频对象分割, 注意力机制, 损失函数, 多尺度特征

Abstract:

For the semi-supervised video object segmentation method, the one-shot video object segmentation (OSVOS) method is guided by the object marking mask of the first frame to separate the foreground objects in the subsequent frames from the video. Despite the impressive segmentation results, this method is not applicable to cases where the appearance of foreground objects changes significantly or the appearances of foreground objects and background are similar. To solve these problems, an imitation U-shaped network structure for video object segmentation was proposed. The attention mechanism was added between the encoder and decoder of this network, thus establishing association between feature maps to generate global semantic information. At the same time, the loss function was optimized to further solve the imbalance between categories and improve the robustness of the model. In addition, multi-scale prediction was combined with fully connected conditional random field (FC/Dense CRF) to improve the smoothness of the edge of segmentation results. A large number of experiments were carried out on the challenging DAVIS 2016 dataset, and the proposed method obtained more competitive segmentation results than the most advanced ones.

Key words: semi-supervised video object segmentation, attention mechanism, loss function, multi-scale feature

中图分类号:

TP391

黄志勇, 韩莎莎, 陈致君, 姚玉, 熊彪, 马凯. 一种用于视频对象分割的仿U形网络[J]. 图学学报, 2023, 44(1): 104-111.

HUANG Zhi-yong, HAN Sha-sha, CHEN Zhi-jun, YAO Yu, XIONG Biao, MA Kai. An imitation U-shaped network for video object segmentation[J]. Journal of Graphics, 2023, 44(1): 104-111.

图/表 10

图1 在DAVIS 2016验证集上定量方法的比较

Fig. 1 Comparison of quantitative methods on the DAVIS 2016 validation dataset

图2 网络结构

Fig. 2 Network structure

图3 双重注意机制集成到卷积神经网络中的过程

Fig. 3 The process of integrating dual attention mechanism into convolutional neural networks

表1 与最新技术比较的结果

Table 1 Results compared with the state-of-the-art methods

Method	OL	J&F Mean[%]	J Mean[%]	F Mean[%]
OSMN^[16]	×	73.45	74.00	72.90
FAVOS^[42]	×	80.95	82.40	79.50
RGMP^[21]	×	81.75	81.50	82.00
FEELVOS^[20]	×	81.65	81.10	82.20
CRVOS^[43]	×	81.60	82.20	81.00
SAT^[44]	×	83.10	82.60	83.60
RANet^[26]	×	85.50	85.50	85.40
MaskTrack^[4]	√	77.55	79.70	75.40
OSVOS^[1]	√	80.20	79.80	80.60
FRTMVOS^[45]	√	83.50	-	-
LucidTracker^[3]	√	83.60	84.80	82.30
STCNN^[46]	√	83.80	83.80	83.80
OnAVOS^[8]	√	84.95	85.70	84.20
PReMVOS^[18]	√	86.75	84.90	88.60
CINM^[47]	√	84.20	83.40	85.00
MHPVOS^[48]	√	88.55	87.60	89.50
Ours	√	87.07	86.26	87.88

图4 每种方法的视频序列之间的区域相似性J比较

Fig. 4 Regional similarity J comparison between video sequences of each method

图5 定性结果比较

Fig. 5 Comparison of qualitative results ((a) Ours; (b) MHPVOS; (c) CINM; (d) FEELVOS; (e) FAVOS; (f) OSVOS; (g) MaskTrack; (h) LucidTracker; (i) Ground truth)

表2 在DAVIS 2016验证集上的消融实验(%)

Table 2 Ablation experiments on the DAVIS 2016 validation dataset (%)

Method	J Mean	F Mean
Ours	86.26	87.88
DA	84.29	84.58
Dense CRF	74.03	74.56

表3 损失函数的消融实验(%)

Table 3 Ablation experiment on loss function (%)

Method	J Mean	F Mean
Ours	74.03	74.56
Original	73.79	73.35

图6 损失函数优化的定性结果

Fig. 6 Qualitative results of loss function optimization ((a) Ground truth; (b) Original; (c) Ours)

图7 Paragliding-launch与Kite-surf视频序列的定性结果

Fig. 7 Qualitative results of paragliding launch and kite surf video sequences ((a) Frame 8; (b) Frame 18; (c) Frame 28)

参考文献 48

[1]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5320-5329.
[2]	JAIN S D, XIONG B, GRAUMAN K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2126.
[3]	KHOREVA A, BENENSON R, ILG E, et al. Lucid data dreaming for video object segmentation[J]. International Journal of Computer Vision, 2019, 127(9): 1175-1197. DOI
[4]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3491-3500.
[5]	KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. (2014-12-22) [2022-01-30].https://arxiv.org/abs/1412.6980.
[6]	HELD D, THRUN S, SAVARESE S. Learning to track at 100 FPS with deep regression networks[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 749-765.
[7]	NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4293-4302.
[8]	VOIGTLAENDER P, LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[C]//The British Machine Vision Conference 2017. Durham University: British Machine Vision Association, 2017: 1-13.
[9]	GRIFFIN B A, CORSO J J. BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8906-8915.
[10]	SHARIR G, SMOLYANSKY E, FRIEDMAN I. Video object segmentation using tracked object proposals[EB/OL]. [2022-01-03].https://arxiv.org/abs/1707.06545.
[11]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL]. [2022-01-03]. https://arxiv.org/abs/1412.7062.
[12]	HU Y T, HUANG J B, SCHWING A. Maskrnn: Instance level video object segmentation[C]//Neural Information Processing Systems. California: MIT Press, 2017: 325-334.
[13]	MÄRKI N, PERAZZI F, WANG O, et al. Bilateral space video segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 743-751.
[14]	JAMPANI V, GADDE R, GEHLER P V. Video propagation networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3154-3164.
[15]	CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424.
[16]	YANG L J, WANG Y R, XIONG X H, et al. Efficient video object segmentation via network modulation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6499-6507.
[17]	XIAO H X, FENG J S, LIN G S, et al. MoNet: deep motion exploitation for video object segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 1140-1148.
[18]	LUITEN J, VOIGTLAENDER P, LEIBE B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[M]//Computer Vision - ACCV 2018. Cham: Springer International Publishing, 2018: 565-580.
[19]	HU Y T, HUANG J B, SCHWING A G. VideoMatch: matching based video object segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 56-73.
[20]	VOIGTLAENDER P, CHAI Y N, SCHROFF F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 9473-9482.
[21]	OH S W, LEE J Y, SUNKAVALLI K, et al. Fast video object segmentation by reference-guided mask propagation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7376-7385.
[22]	OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9226-9235.
[23]	JOHNANDER J, DANELLJAN M, BRISSMAN E, et al. A generative appearance model for end-to-end video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8945-8954.
[24]	LIN H J, QI X J, JIA J Y. AGSS-VOS: attention guided single-shot video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3948-3956.
[25]	ZENG X H, LIAO R J, GU L, et al. DMM-net: differentiable mask-matching network for video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3929-3938.
[26]	WANG Z Q, XU J, LIU L, et al. RANet: ranking attention network for fast video object segmentation[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3978-3987.
[27]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[28]	BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. DOI PMID
[29]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17) [2021-12-05].https://arxiv.org/abs/1706.05587.
[30]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 801-818.
[31]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324.
[32]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04) [2022-01-10].https://arxiv.org/abs/1409.1556.
[33]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154.
[34]	SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651. DOI PMID
[35]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9.
[36]	KRÄHENBÜHL P, KOLTUN V.Efficient inference in fully connected CRFs with Gaussian edge potentials[C]// The 24th International Conference on Neural Information Processing Systems. New York: ACM, 2012: 109-117.
[37]	XIE S N, TU Z W. Holistically-nested edge detection[C]//2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1395-1403.
[38]	MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Convolutional oriented boundaries: from image segmentation to high-level tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 819-833. DOI URL
[39]	MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Deep retinal image understanding[M]//Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016. Cham: Springer International Publishing, 2016: 140-148.
[40]	PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 724-732.
[41]	PONT-TUSET J, PERAZZI F, CAELLES S, et al. The 2017 DAVIS challenge on video object segmentation[EB/OL]. (2017-04-03) [2022-01-10].https://arxiv.org/abs/1704.00675.
[42]	CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424.
[43]	CHO S, CHO M, CHUNG T Y, et al. Crvos: clue refining network for video object segmentation[C]//2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2301-2305.
[44]	CHEN X, LI Z X, YUAN Y, et al. State-aware tracker for real-time video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9384-9393.
[45]	ROBINSON A, JÄREMO LAWIN F, DANELLJAN M, et al. Learning fast and robust target models for video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7406-7415.
[46]	XU K, WEN L Y, LI G R, et al. Spatiotemporal CNN for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1379-1388.
[47]	BAO L C, WU B Y, LIU W.CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5977-5986.
[48]	XU S J, LIU D Z, BAO L C, et al. MHP-VOS: multiple hypotheses propagation for video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 314-323.