An imitation U-shaped network for video object segmentation

doi:10.11996/JG.j.2095-302X.2023010104

Abstract

Abstract:

For the semi-supervised video object segmentation method, the one-shot video object segmentation (OSVOS) method is guided by the object marking mask of the first frame to separate the foreground objects in the subsequent frames from the video. Despite the impressive segmentation results, this method is not applicable to cases where the appearance of foreground objects changes significantly or the appearances of foreground objects and background are similar. To solve these problems, an imitation U-shaped network structure for video object segmentation was proposed. The attention mechanism was added between the encoder and decoder of this network, thus establishing association between feature maps to generate global semantic information. At the same time, the loss function was optimized to further solve the imbalance between categories and improve the robustness of the model. In addition, multi-scale prediction was combined with fully connected conditional random field (FC/Dense CRF) to improve the smoothness of the edge of segmentation results. A large number of experiments were carried out on the challenging DAVIS 2016 dataset, and the proposed method obtained more competitive segmentation results than the most advanced ones.

Key words: semi-supervised video object segmentation, attention mechanism, loss function, multi-scale feature

CLC Number:

TP391

HUANG Zhi-yong, HAN Sha-sha, CHEN Zhi-jun, YAO Yu, XIONG Biao, MA Kai. An imitation U-shaped network for video object segmentation[J]. Journal of Graphics, 2023, 44(1): 104-111.

Figures/Tables 10

Fig. 1 Comparison of quantitative methods on the DAVIS 2016 validation dataset

Fig. 2 Network structure

Fig. 3 The process of integrating dual attention mechanism into convolutional neural networks

Table 1 Results compared with the state-of-the-art methods

Method	OL	J&F Mean[%]	J Mean[%]	F Mean[%]
OSMN^[16]	×	73.45	74.00	72.90
FAVOS^[42]	×	80.95	82.40	79.50
RGMP^[21]	×	81.75	81.50	82.00
FEELVOS^[20]	×	81.65	81.10	82.20
CRVOS^[43]	×	81.60	82.20	81.00
SAT^[44]	×	83.10	82.60	83.60
RANet^[26]	×	85.50	85.50	85.40
MaskTrack^[4]	√	77.55	79.70	75.40
OSVOS^[1]	√	80.20	79.80	80.60
FRTMVOS^[45]	√	83.50	-	-
LucidTracker^[3]	√	83.60	84.80	82.30
STCNN^[46]	√	83.80	83.80	83.80
OnAVOS^[8]	√	84.95	85.70	84.20
PReMVOS^[18]	√	86.75	84.90	88.60
CINM^[47]	√	84.20	83.40	85.00
MHPVOS^[48]	√	88.55	87.60	89.50
Ours	√	87.07	86.26	87.88

Fig. 4 Regional similarity J comparison between video sequences of each method

Fig. 5 Comparison of qualitative results ((a) Ours; (b) MHPVOS; (c) CINM; (d) FEELVOS; (e) FAVOS; (f) OSVOS; (g) MaskTrack; (h) LucidTracker; (i) Ground truth)

Table 2 Ablation experiments on the DAVIS 2016 validation dataset (%)

Method	J Mean	F Mean
Ours	86.26	87.88
DA	84.29	84.58
Dense CRF	74.03	74.56

Table 3 Ablation experiment on loss function (%)

Method	J Mean	F Mean
Ours	74.03	74.56
Original	73.79	73.35

Fig. 6 Qualitative results of loss function optimization ((a) Ground truth; (b) Original; (c) Ours)

Fig. 7 Qualitative results of paragliding launch and kite surf video sequences ((a) Frame 8; (b) Frame 18; (c) Frame 28)

References 48

[1]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5320-5329.
[2]	JAIN S D, XIONG B, GRAUMAN K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2126.
[3]	KHOREVA A, BENENSON R, ILG E, et al. Lucid data dreaming for video object segmentation[J]. International Journal of Computer Vision, 2019, 127(9): 1175-1197. DOI
[4]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3491-3500.
[5]	KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. (2014-12-22) [2022-01-30].https://arxiv.org/abs/1412.6980.
[6]	HELD D, THRUN S, SAVARESE S. Learning to track at 100 FPS with deep regression networks[M]//Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 749-765.
[7]	NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4293-4302.
[8]	VOIGTLAENDER P, LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[C]//The British Machine Vision Conference 2017. Durham University: British Machine Vision Association, 2017: 1-13.
[9]	GRIFFIN B A, CORSO J J. BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8906-8915.
[10]	SHARIR G, SMOLYANSKY E, FRIEDMAN I. Video object segmentation using tracked object proposals[EB/OL]. [2022-01-03].https://arxiv.org/abs/1707.06545.
[11]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL]. [2022-01-03]. https://arxiv.org/abs/1412.7062.
[12]	HU Y T, HUANG J B, SCHWING A. Maskrnn: Instance level video object segmentation[C]//Neural Information Processing Systems. California: MIT Press, 2017: 325-334.
[13]	MÄRKI N, PERAZZI F, WANG O, et al. Bilateral space video segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 743-751.
[14]	JAMPANI V, GADDE R, GEHLER P V. Video propagation networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 3154-3164.
[15]	CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424.
[16]	YANG L J, WANG Y R, XIONG X H, et al. Efficient video object segmentation via network modulation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 6499-6507.
[17]	XIAO H X, FENG J S, LIN G S, et al. MoNet: deep motion exploitation for video object segmentation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 1140-1148.
[18]	LUITEN J, VOIGTLAENDER P, LEIBE B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[M]//Computer Vision - ACCV 2018. Cham: Springer International Publishing, 2018: 565-580.
[19]	HU Y T, HUANG J B, SCHWING A G. VideoMatch: matching based video object segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 56-73.
[20]	VOIGTLAENDER P, CHAI Y N, SCHROFF F, et al. FEELVOS: fast end-to-end embedding learning for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 9473-9482.
[21]	OH S W, LEE J Y, SUNKAVALLI K, et al. Fast video object segmentation by reference-guided mask propagation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7376-7385.
[22]	OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9226-9235.
[23]	JOHNANDER J, DANELLJAN M, BRISSMAN E, et al. A generative appearance model for end-to-end video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 8945-8954.
[24]	LIN H J, QI X J, JIA J Y. AGSS-VOS: attention guided single-shot video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3948-3956.
[25]	ZENG X H, LIAO R J, GU L, et al. DMM-net: differentiable mask-matching network for video object segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3929-3938.
[26]	WANG Z Q, XU J, LIU L, et al. RANet: ranking attention network for fast video object segmentation[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 3978-3987.
[27]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[M]// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015: 234-241.
[28]	BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. DOI PMID
[29]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17) [2021-12-05].https://arxiv.org/abs/1706.05587.
[30]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[M]//Computer Vision - ECCV 2018. Cham: Springer International Publishing, 2018: 801-818.
[31]	HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1314-1324.
[32]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04) [2022-01-10].https://arxiv.org/abs/1409.1556.
[33]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3146-3154.
[34]	SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651. DOI PMID
[35]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 1-9.
[36]	KRÄHENBÜHL P, KOLTUN V.Efficient inference in fully connected CRFs with Gaussian edge potentials[C]// The 24th International Conference on Neural Information Processing Systems. New York: ACM, 2012: 109-117.
[37]	XIE S N, TU Z W. Holistically-nested edge detection[C]//2015 IEEE International Conference on Computer Vision. New York: IEEE Press, 2015: 1395-1403.
[38]	MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Convolutional oriented boundaries: from image segmentation to high-level tasks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 819-833. DOI URL
[39]	MANINIS K K, PONT-TUSET J, ARBELÁEZ P, et al. Deep retinal image understanding[M]//Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016. Cham: Springer International Publishing, 2016: 140-148.
[40]	PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 724-732.
[41]	PONT-TUSET J, PERAZZI F, CAELLES S, et al. The 2017 DAVIS challenge on video object segmentation[EB/OL]. (2017-04-03) [2022-01-10].https://arxiv.org/abs/1704.00675.
[42]	CHENG J C, TSAI Y H, HUNG W C, et al. Fast and accurate online video object segmentation via tracking parts[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7415-7424.
[43]	CHO S, CHO M, CHUNG T Y, et al. Crvos: clue refining network for video object segmentation[C]//2020 IEEE International Conference on Image Processing. New York: IEEE Press, 2020: 2301-2305.
[44]	CHEN X, LI Z X, YUAN Y, et al. State-aware tracker for real-time video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 9384-9393.
[45]	ROBINSON A, JÄREMO LAWIN F, DANELLJAN M, et al. Learning fast and robust target models for video object segmentation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 7406-7415.
[46]	XU K, WEN L Y, LI G R, et al. Spatiotemporal CNN for video object segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1379-1388.
[47]	BAO L C, WU B Y, LIU W.CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5977-5986.
[48]	XU S J, LIU D Z, BAO L C, et al. MHP-VOS: multiple hypotheses propagation for video object segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 314-323.