A deep architecture for reciprocal object detection and instance segmentation

doi:10.11996/JG.j.2095-302X.2024040745

Abstract

Abstract:

Object detection and instance segmentation are two fundamental and closely correlated tasks in computer vision, yet their relations have not been fully explored in most previous works.For this reason, we presented the reciprocal object detection and instance segmentation network (RDSNet), a novel deep architecture. To reciprocate between these two tasks, we designed a two-stream structure to learn feature representations jointly at both the object level (i.e., bounding boxes) and the pixel level (i.e., instance masks), thus encoding object- and pixel-level information respectively. Moreover, three new modules were introduced for the interactions between the two streams, allowing object-level information to assist instance segmentation and pixel-level information to assist object detection. Specifically, a correlation module was used to measure the similarity between object- and pixel-level features, promoting the consistency in features belonging to the same object and enhancing the accuracy of instance masks consequently. We proposed a cropping module to better distinguish different instances and reduce background noise, by introducing the awareness of instance and translation variance to pixel-level perception. To further refine the alignment between bounding boxes and their corresponding objects, a mask-based boundary refinement module (MBRM) was proposed for the fusion of bounding boxes and instance masks, which had the potential to correct the errors in bounding boxes with the help of instance masks. Extensive experimental analyses and comparisons on the COCO dataset demonstrated the effectiveness and efficiency of RDSNet. In addition, we further improved the performance of RDSNet by integrating the mask scoring strategy into MBRM, which allowed object detection to benefit from instance segmentation in a new way.

Key words: object detection, instance segmentation, reciprocal relation, feature representation, boundary refinement

CLC Number:

TP391

GONG Yongchao, SHEN Xukun. A deep architecture for reciprocal object detection and instance segmentation[J]. Journal of Graphics, 2024, 45(4): 745-759.

Figures/Tables 18

References 40

[1]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI PMID
[2]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 42(2): 318-327.
[3]	DAI J F, LI Y, HE K M, et al. R-FCN: Object detection via region-based fully convolutional networks[C]// The 30th International Conference on Neural Information Processing Systems. New York: ACM, 2016: 379-387.
[4]	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// 2017 International Conference on Computer Vision. New York: IEEE Press, 2017: 2980-2988.
[5]	DAI J F, HE K M, SUN J. Instance-aware semantic segmentation via multi-task network cascades[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 3150-3158.
[6]	LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 8759-8768.
[7]	FU C Y, SHVETS M, BERG A C. RetinaMask: learning to predict masks improves state-of-the-art single-shot detection for free[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1901.03353.
[8]	CHEN H, SUN K Y, TIAN Z, et al. BlendMask: top-down meets bottom-up for instance segmentation[C]// 2020 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 8570-8578.
[9]	LEE Y W, PARK J. Centermask: real-time anchor-free instance segmentation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2594-2603.
[10]	BOLYA D, ZHOU C, XIAO F Y, et al. Yolact: Real-time instance segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 9160-9169.
[11]	TIAN Z, SHEN C H, CHEN H. Conditional convolutions for instance segmentation[C]// European Conference on Computer Vision. Cham: Springer, 2020: 282-298.
[12]	LI Y, QI H Z, DAI J F, et al. Fully convolutional instance-aware semantic segmentation[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 4438-4446.
[13]	WANG X L, KONG T, SHEN C H, et al. SOLO: segmenting objects by locations[M]//Computer Vision-ECCV 2020. Cham: Springer, 2020: 649-665.
[14]	CHEN X L, WANG P, CHENG G, et al. Tensormask: surpassing pixel-level encoding for instance segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 2061-2069.
[15]	DE BRABANDERE B, NEVEN D, VAN GOOL L. Semantic instance segmentation with a discriminative loss function[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1708.02551.
[16]	FATHI A, WOJNA Z, RATHOD V, et al. Semantic instance segmentation via deep metric learning[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1703.10277.
[17]	WANG S R, GONG Y C, XING J L, et al. RDSNet: a new deep architecture for Reciprocal object detection and instance segmentation[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12208-12215.
[18]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// Computer Vision - ECCV 2014. Cham: Springer, 2014: 740-755.
[19]	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]// 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 3213-3223.
[20]	LIN T Y, DOLLÁR P, GIRSHICK R B, et al. Feature pyramid networks for object detection[C]// 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2117-2125.
[21]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[22]	SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 815-823.
[23]	SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 761-769.
[24]	CHEN K, WANG J Q, PANG J M, et al. MMDetection: open MMLab detection toolbox and benchmark[EB/OL]. [2023- 10-18]. http://arxiv.org/abs/1906.07155.
[25]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[26]	KIRILLOV A, GIRSHICK R, HE K M, et al. Panoptic feature pyramid networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6392-6401.
[27]	HUANG Z J, HUANG L C, GONG Y C, et al. Mask scoring R-CNN[C]// 2019 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6402-6411.
[28]	BOLYA D, ZHOU C, XIAO F Y, et al. YOLACT better real-time instance segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2): 1108-1121.
[29]	XIE E Z, SUN P Z, SONG X G, et al. PolarMask: single shot instance segmentation with polar representation[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1909.13226.
[30]	ZHANG R F, TIAN Z, SHEN C H, et al. Mask encoding for single shot instance segmentation[EB/OL]. [2023-10-18]. http://arxiv.org/abs/2003.11712.
[31]	CHEN Y T, HAN C X, WANG N Y, et al. Revisiting feature alignment for one-stage object detection[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1908.01570.
[32]	CAI Z W, VASCONCELOS N. Cascade r-cnn: Delving into high quality object detection[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 6154-6162.
[33]	CHEN K, PANG J M, WANG J Q, et al. Hybrid task cascade for instance segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4974-4983.
[34]	REDMON J, FARHADI A. YOLOv3: an incremental improvement[EB/OL]. [2023-10-18]. http://arxiv.org/abs/1804.02767.
[35]	ZHANG S F, WEN L Y, BIAN X, et al. Single-shot refinement neural network for object detection[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 4203-4212.
[36]	LAW H, DENG J. CornerNet: detecting objects as paired keypoints[M]//Computer Vision-ECCV 2018. Cham: Springer, 2018: 765-781.
[37]	NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[M]//Computer Vision-ECCV 2016. Cham: Springer, 2016: 483-499.
[38]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[M]//Computer Vision-ECCV 2016. Cham: Springer, 2016: 21-37.
[39]	梁正兴, 王先兵, 何涛, 等. 实例分割和边缘优化算法的研究与实现[J]. 图学学报, 2020, 41(6): 939-946.
	LIANG Z X, WANG X B, HE T, et al. Research and implementation of instance segmentation and edge optimization algorithm[J]. Journal of Graphics, 2020, 41(6): 939-946 (in Chinese).
[40]	崔振东, 李宗民, 杨树林, 等. 基于语义分割引导的三维目标检测[J]. 图学学报, 2022, 43(6): 1134-1142.
	CUI Z D, LI Z M, YANG S L, et al. 3D object detection based on semantic segmentation guidance[J]. Journal of Graphics, 2022, 43(6): 1134-1142 (in Chinese).

类型	方法	尺寸	帧率	AP^m	AP₅₀^m	AP₇₅^m	AP_S^m	AP₅₀^M	AP_L^m
两阶段	Mask R-CNN^[4]	800	9.5 (V)	36.2	58.3	38.6	16.7	38.8	51.5
	MS R-CNN^[27]	800	9.1 (V)	37.4	57.9	40.4	17.3	39.5	53.0
	RetinaMask^[7]	800	6.0 (V)	34.7	55.4	36.9	14.3	36.7	50.5
单阶段	FCIS^[12]	600	6.6 (P)	29.2	49.5	-	7.1	31.3	50.0
	YOLACT^[10]	550	33.0 (P)	29.8	48.5	31.2	9.9	31.3	47.7
	YOLACT++^[28]	550	27.0 (P)	34.6	53.8	36.9	11.9	36.8	55.1
	PolarMask^[29]	550	23.9 (P)	30.4	51.9	31.0	13.4	32.4	42.8
	RDSNet	550	32.0 (P)	32.1	53.0	33.4	11.0	33.8	51.0
	MEInst^[30]	800	12.8 (P)	33.9	56.2	35.4	19.8	36.1	42.3
	SOLO^[13]	800	10.4 (V)	37.8	59.5	40.4	16.4	40.6	54.2
	TensorMask^[14]	800	2.6 (V)	37.3	59.5	39.5	17.5	39.3	51.6
	RDSNet (以文献[2]为基线)	800	8.8 (V)	36.4	57.9	39.0	16.4	39.5	51.6
	RDSNet (以文件[31]为基线)	800	7.5 (P)	37.5	59.3	40.4	16.9	40.5	53.0
	RDSNet+ (以文献[2]为基线)	800	8.8 (V)	37.2	59.1	40.2	16.8	41.2	52.8
	RDSNet+ (以文献[31]为基线)	800	7.5 (P)	38.5	60.4	41.8	17.3	41.8	54.3
性能上限	RDSNet (基于真值框)	800	-	58.7	68.5	63.1	49.2	59.0	75.4

类型	方法	尺寸	帧率	AP^m	AP₅₀^m	AP₇₅^m	AP_S^m	AP₅₀^M	AP_L^m
两阶段	Mask R-CNN^[4]	800	9.5 (V)	36.2	58.3	38.6	16.7	38.8	51.5
	MS R-CNN^[27]	800	9.1 (V)	37.4	57.9	40.4	17.3	39.5	53.0
	RetinaMask^[7]	800	6.0 (V)	34.7	55.4	36.9	14.3	36.7	50.5
单阶段	FCIS^[12]	600	6.6 (P)	29.2	49.5	-	7.1	31.3	50.0
	YOLACT^[10]	550	33.0 (P)	29.8	48.5	31.2	9.9	31.3	47.7
	YOLACT++^[28]	550	27.0 (P)	34.6	53.8	36.9	11.9	36.8	55.1
	PolarMask^[29]	550	23.9 (P)	30.4	51.9	31.0	13.4	32.4	42.8
	RDSNet	550	32.0 (P)	32.1	53.0	33.4	11.0	33.8	51.0
	MEInst^[30]	800	12.8 (P)	33.9	56.2	35.4	19.8	36.1	42.3
	SOLO^[13]	800	10.4 (V)	37.8	59.5	40.4	16.4	40.6	54.2
	TensorMask^[14]	800	2.6 (V)	37.3	59.5	39.5	17.5	39.3	51.6
	RDSNet (以文献[2]为基线)	800	8.8 (V)	36.4	57.9	39.0	16.4	39.5	51.6
	RDSNet (以文件[31]为基线)	800	7.5 (P)	37.5	59.3	40.4	16.9	40.5	53.0
	RDSNet+ (以文献[2]为基线)	800	8.8 (V)	37.2	59.1	40.2	16.8	41.2	52.8
	RDSNet+ (以文献[31]为基线)	800	7.5 (P)	38.5	60.4	41.8	17.3	41.8	54.3
性能上限	RDSNet (基于真值框)	800	-	58.7	68.5	63.1	49.2	59.0	75.4

类型	方法		尺寸	主干网络	帧率	AP^bb	AP₅₀^bb	AP₇₅^bb	AP_S^bb	AP_M^bb	AP_L^bb
两阶段	Mask R-CNN^[4]		800	R-101	9.5 (V)	39.7	61.6	43.2	23.0	43.2	49.7
	Cascade R-CNN^[32]		800	R-101	6.8 (V)	43.1	61.5	46.9	24.0	45.9	55.4
	HTC^[33]		800	R-101	4.1 (V)	45.1	64.3	49.0	25.2	48.0	58.2
单阶段	YOLOv3^[34]		608	D-53	19.8 (P)	33.0	57.9	34.3	18.3	35.4	41.9
	RefineDet^[35]		512	R-101	9.1 (P)	36.4	57.5	39.5	16.6	39.9	51.4
	CornerNet^[36]		512	H-104	4.4 (P)	40.5	57.8	45.3	20.8	44.8	56.7
	RDSNet	基线^[2]	800	R-101	10.9 (V)	38.1	58.5	40.8	21.2	41.5	48.2
		w/o MBRM			8.8 (V)	39.4	60.1	42.5	22.1	42.6	49.9
		with MBRM			8.5 (V)	40.3	60.1	43.0	22.1	43.5	51.5
		基线^[31]	800	R-101	9.1 (P)	42.0	62.4	46.5	24.6	44.8	53.3
		w/o MBRM			7.5 (P)	42.3	62.5	46.8	24.7	44.8	53.5
		with MBRM			7.3 (P)	43.2	63.7	48.0	25.0	45.2	56.1
	RDSNet+	以文献[2]为检测器	800	R-101	8.4 (V)	41.4	60.9	44.3	22.5	44.0	52.4
	RDSNet+	以文献[31]为检测器	800	R-101	7.2 (P)	44.3	64.1	49.2	25.3	45.9	56.8

类型	方法		尺寸	主干网络	帧率	AP^bb	AP₅₀^bb	AP₇₅^bb	AP_S^bb	AP_M^bb	AP_L^bb
两阶段	Mask R-CNN^[4]		800	R-101	9.5 (V)	39.7	61.6	43.2	23.0	43.2	49.7
	Cascade R-CNN^[32]		800	R-101	6.8 (V)	43.1	61.5	46.9	24.0	45.9	55.4
	HTC^[33]		800	R-101	4.1 (V)	45.1	64.3	49.0	25.2	48.0	58.2
单阶段	YOLOv3^[34]		608	D-53	19.8 (P)	33.0	57.9	34.3	18.3	35.4	41.9
	RefineDet^[35]		512	R-101	9.1 (P)	36.4	57.5	39.5	16.6	39.9	51.4
	CornerNet^[36]		512	H-104	4.4 (P)	40.5	57.8	45.3	20.8	44.8	56.7
	RDSNet	基线^[2]	800	R-101	10.9 (V)	38.1	58.5	40.8	21.2	41.5	48.2
		w/o MBRM			8.8 (V)	39.4	60.1	42.5	22.1	42.6	49.9
		with MBRM			8.5 (V)	40.3	60.1	43.0	22.1	43.5	51.5
		基线^[31]	800	R-101	9.1 (P)	42.0	62.4	46.5	24.6	44.8	53.3
		w/o MBRM			7.5 (P)	42.3	62.5	46.8	24.7	44.8	53.5
		with MBRM			7.3 (P)	43.2	63.7	48.0	25.0	45.2	56.1
	RDSNet+	以文献[2]为检测器	800	R-101	8.4 (V)	41.4	60.9	44.3	22.5	44.0	52.4
	RDSNet+	以文献[31]为检测器	800	R-101	7.2 (P)	44.3	64.1	49.2	25.3	45.9	56.8

No.	方法	模块	TE	OHEM	IE	帧率	AP^m
1	YOLACT^[10]	LC				33	29.9
2	RDSNet_s	Corr				32	31.0_+1.1
3					√		30.0
4			√				30.7
5				√			31.2
6			√		√		30.8
7			√	√			31.6
8			√	√	√		31.8_+1.9
9	RDSNet_f	Corr	√		√	29	28.8
10	RDSNet_f	Corr	√	√	√	29	28.5