SAM2-based multi-objective automatic segmentation method for laparoscopic surgery

doi:10.11996/JG.j.2095-302X.2025050969

Abstract

Abstract:

Automatic segmentation in laparoscopic surgical scenes is a critical for enabling surgical robots to perform autonomous operations. However, this task faces three major challenges: the high similarity in texture and blurred boundaries of surgical targets, making accurate segmentation difficult; significant scale differences, which hinder the synchronous segmentation of multiple targets; and intraoperative interferences, such as motion artifacts and smoke occlusion, that affect segmentation completeness. To address these challenges, a multi-objective automatic segmentation method for laparoscopic surgery (SAM2-MSNet) based on the visual large model SAM2 was proposed. The network employed a LoRA+ fine-tuning strategy to optimize SAM2’s image encoder, enabling efficient adaptation to the texture features of laparoscopic images. A cross-scale feature synchronous extraction module was designed to realize accurate segmentation of multi-scale targets. Furthermore, a global perception module of feature relationships was constructed to enhance the anti-interference abilities, such as motion artifacts and smoke occlusion. Additionally, a pseudo-label-assisted supervision mechanism driven by directional gradient histograms significantly enhanced the accuracy of target edge segmentation. Experimental results demonstrated that SAM2-MSNet achieved a mean intersection over union (mIoU) of 70.2%/69.6% and a mean Dice coefficient (mDice) of 78.5%/75.0% on the Endovis2018 and AutoLaparo datasets. On the premise that the reasoning speed was equivalent to that of SAM2-UNet (23 frames per second vs. 25 frames per second), the segmentation accuracy was significantly improved by 3.0%/6.7% (mIoU) and 2.8%/6.8% (mDice). This work enabled high-precision automatic segmentation for laparoscopic surgical scenes, providing a robust technical foundation for the autonomous operation of surgical robots.

Key words: laparoscopic surgical scene segmentation, visual large model, synchronous extraction of cross-scale features, global perception of feature relationships, pseudo-label assisted supervision

CLC Number:

R656
TP391

LIU Cheng, ZHANG Jiayi, YUAN Feng, ZHANG Rui, GAO Xin. SAM2-based multi-objective automatic segmentation method for laparoscopic surgery[J]. Journal of Graphics, 2025, 46(5): 969-979.

Figures/Tables 15

References 22

[1]	ALLAN M, KONDO S, BODENSTEDT S, et al. 2018 robotic scene segmentation challenge[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2001.11190v3.
[2]	ALLAN M, SHVETS A, KURMANN T, et al. 2017 robotic instrument segmentation challenge[EB/OL]. [2025-04-26]. https://arxiv.org/abs/1902.06426.
[3]	NI Z L, BIAN G B, LI Z, et al. Space squeeze reasoning and low-rank bilinear feature fusion for surgical image segmentation[J]. IEEE Journal of Biomedical and Health Informatics, 2022, 26(7): 3209-3217.
[4]	JIN Y M, YU Y, CHEN C, et al. Exploring intra- and inter-video relation for surgical semantic scene segmentation[J]. IEEE Transactions on Medical Imaging, 2022, 41(11): 2991-3002.
[5]	LIU M, HAN Y B, WANG J Z, et al. LSKANet: long strip kernel attention network for robotic surgical scene segmentation[J]. IEEE Transactions on Medical Imaging, 2024, 43(4): 1308-1322.
[6]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[7]	CHEN C, MIAO J Z, WU D F, et al. MA-SAM: modality-agnostic SAM adaptation for 3D medical image segmentation[J]. Medical Image Analysis, 2024, 98: 103310.
[8]	张新宇, 张家意, 高欣. ASC-Net: 腹腔镜视频中手术器械与脏器快速分割网络[J]. 图学学报, 2024, 45(4): 659-669. DOI
	ZHANG X Y, ZHANG J Y, GAO X. ASC-Net: fast segmentation network for surgical instruments and organs in laparoscopic video[J]. Journal of Graphics, 2024, 45(4): 659-669 (in Chinese). DOI
[9]	RAVI N, GABEUR V, HU Y T, et al. SAM 2:segment anything in images and videos[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2408.00714v2.
[10]	HAYOU S, GHOSH N, YU B. LoRA+: efficient low rank adaptation of large models[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2402.12354v1.
[11]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2005: 886-893.
[12]	BHATTARAI B, SUBEDI R, GAIRE R R, et al. Histogram of oriented gradients meet deep learning: a novel multi-task deep network for 2D surgical image semantic segmentation[J]. Medical Image Analysis, 2023, 85: 102747.
[13]	WANG Z Y, LU B, LONG Y H, et al. AutoLaparo: a new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy[C]// The 25th International Conference on Medical Image Computing and Computer Assisted Intervention. Cham: Springer, 2022: 486-496.
[14]	HU E J, SHEN Y L, WALLIS P, et al. LORA: low-rank adaptation of large language models[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2106.09685v1.
[15]	XIONG X Y, WU Z H, TAN S Y, et al. SAM2-UNET: segment anything 2 makes strong encoder for natural and medical image segmentation[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2408.08870.
[16]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015. Cham: Springer, 2015: 234-241.
[17]	IGLOVIKOV V, SHVETS A. TernausNet: U-net with VGG 11 encoder pre-trained on ImageNet for image segmentation[EB/OL]. [2025-04-26]. https://arxiv.org/abs/1801.05746.
[18]	CHAURASIA A, CULURCIELLO E. LinkNet: exploiting encoder representations for efficient semantic segmentation[C]// 2017 IEEE Visual Communications and Image Processing. New York: IEEE Press, 2017: 1-4.
[19]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder- decoder with atrous separable convolution for semantic image segmentation[C]// The 15th European Conference on Computer Vision. New York: IEEE Press, 2018: 833-851.
[20]	RAHMAN M M, MUNIR M, MARCULESCU R. EMCAD: efficient multi-scale convolutional attention decoding for medical image segmentation[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 11769-11779.
[21]	CHEN J N, MEI J R, LI X H, et al. TransUNet: rethinking the U-Net architecture design for medical image segmentation through the lens of transformers[J]. Medical Image Analysis, 2024, 97: 103280.
[22]	RUAN J C, LI J C, XIANG S C. VM-UNeT: vision mamba UNet for medical image segmentation[EB/OL]. [2025-04-26]. https://arxiv.org/abs/2402.02491v2.

模型	SECF	GPFR	GHOG	Endovis2018		AutoLaparo		FPS/(帧/秒)↑
模型	SECF	GPFR	GHOG	mIoU / %↑	mDice / %↑	mIoU / %↑	mDice / %↑	FPS/(帧/秒)↑
基线模型	×	×	×	67.7	76.0	67.1	72.5	28
	√	×	×	68.1	76.4	68.2	73.8	25
	×	√	×	68.2	76.8	67.4	72.9	26
	×	×	√	68.7	77.0	67.7	73.2	28
	√	√	×	68.6	77.1	69.1	74.7	23
	√	×	√	68.9	77.2	68.8	74.1	25
	×	√	√	69.4	77.8	68.6	74.2	26
	√	√	√	70.2	78.5	69.6	75.0	23

模型	SECF	GPFR	GHOG	Endovis2018		AutoLaparo		FPS/(帧/秒)↑
模型	SECF	GPFR	GHOG	mIoU / %↑	mDice / %↑	mIoU / %↑	mDice / %↑	FPS/(帧/秒)↑
基线模型	×	×	×	67.7	76.0	67.1	72.5	28
	√	×	×	68.1	76.4	68.2	73.8	25
	×	√	×	68.2	76.8	67.4	72.9	26
	×	×	√	68.7	77.0	67.7	73.2	28
	√	√	×	68.6	77.1	69.1	74.7	23
	√	×	√	68.9	77.2	68.8	74.1	25
	×	√	√	69.4	77.8	68.6	74.2	26
	√	√	√	70.2	78.5	69.6	75.0	23

模型	Endovis2018			AutoLaparo
模型	mIoU / %↑	mDice/%↑	FPS/(帧/秒)↑	mIoU/%↑	mDice/%↑	FPS/(帧/秒)↑
UNet	35.6	45.3	299	23.9	28.5	342
TernausNet	40.6	50.1	315	30.5	35.4	264
LinkNet	54.4	63.6	186	41.4	46.1	162
DeepLabv3+	58.6	67.9	129	51.4	57.0	126
EMCAD	63.2	71.8	27	62.5	67.9	23
TransUNet	59.9	69.5	45	53.2	59.1	44
VM-UNet	51.6	60.9	51	29.0	34.2	45
MASAM*	69.2	77.0	-	-	-	-
SAM2-UNet	67.2	75.7	24	62.9	68.2	25
SAM2-MSNet	70.2	78.5	23	69.6	75.0	23

模型	Endovis2018			AutoLaparo
模型	mIoU / %↑	mDice/%↑	FPS/(帧/秒)↑	mIoU/%↑	mDice/%↑	FPS/(帧/秒)↑
UNet	35.6	45.3	299	23.9	28.5	342
TernausNet	40.6	50.1	315	30.5	35.4	264
LinkNet	54.4	63.6	186	41.4	46.1	162
DeepLabv3+	58.6	67.9	129	51.4	57.0	126
EMCAD	63.2	71.8	27	62.5	67.9	23
TransUNet	59.9	69.5	45	53.2	59.1	44
VM-UNet	51.6	60.9	51	29.0	34.2	45
MASAM*	69.2	77.0	-	-	-	-
SAM2-UNet	67.2	75.7	24	62.9	68.2	25
SAM2-MSNet	70.2	78.5	23	69.6	75.0	23

模型	IoU_c / %↑										mIoU_c / %↑
模型	杆轴	执行器	腕部	肾实质	肾膜	缝合线	夹子	缝合针	肠道	超声器	mIoU_c / %↑
UNet	66.0	34.1	35.7	38.3	2.2	9.5	51.9	0.0	11.5	0.3	24.9
TernausNet	74.2	40.9	43.3	35.2	5.9	6.0	52.8	0.0	19.2	9.1	28.6
LinkNet	80.7	49.2	54.0	61.8	16.3	17.4	67.8	0.0	39.9	13.4	40.1
DeepLabv3+	81.1	53.5	58.2	64.6	28.7	28.5	77.0	0.0	41.4	8.1	44.1
EMCAD	84.4	61.4	63.4	65.4	34.2	38.9	81.7	1.7	39.1	34.2	50.5
TransUNet	82.5	52.2	56.5	62.7	43.6	32.0	70.3	0.0	46.0	18.8	46.5
VM-UNet	78.2	47.7	41.9	62.3	21.3	25.0	30.9	0.0	43.1	0.0	35.0
SAM2-UNet	86.8	59.6	64.2	71.1	49.6	46.9	83.4	0.0	53.9	42.2	55.8
SAM2-MSNet	87.7	62.1	64.2	76.6	56.6	47.8	83.3	0.0	63.5	50.0	59.2