BPA-SAM: box prompt augmented SAM for traditional Chinese realistic painting

doi:10.11996/JG.j.2095-302X.2025020322

Abstract

Abstract:

Due to the lack of publicly available meticulously annotated datasets for traditional Chinese realistic painting, the development of image segmentation techniques in this field is severely hindered. Traditional Chinese realistic painting exhibits characteristics such as similarity in object and background color textures, as well as blurred object boundaries due to the use of gradient transitions, posing challenges for image segmentation. The emergence of the segment anything model (SAM) presents new possibilities for addressing these challenges. Despite SAM demonstrating remarkable segmentation capabilities and zero-shot generalization in the natural image domain, it faces issues of insensitivity to object details and foreground-background confusion when processing traditional Chinese realistic painting. To address these issues, a segmented Traditional Chinese realistic painting dataset themed around flowers and birds was constructed, comprising 403 images with 5 classes of fore-ground objects. Subsequently, we employed the LoRA (Low-Rank Adaptation) method was employed to fine-tune SAM, enabling it to adapt to the characteristics of traditional Chinese realistic paintings. Additionally, a novel boundary box prompting enhancement method called BPA-SAM was proposed, based on the U-Net model, to address fore-ground-background confusion by generating point prompts within the boundary box range. Ultimately, experiments confirmed that our approach improved SAM’s segmentation performance by 7.1% under boundary box prompting conditions, establishing a foundation for SAM’s image segmentation applications in the traditional Chinese realistic painting domain.

Key words: deep learning, image segmentation, traditional Chinese realistic painting, prompt augmentation, computer vision

CLC Number:

TP391
J212

ZHANG Tiansheng, ZHU Minfeng, REN Yiwen, WANG Chenhan, ZHANG Lidong, ZHANG Wei, CHEN Wei. BPA-SAM: box prompt augmented SAM for traditional Chinese realistic painting[J]. Journal of Graphics, 2025, 46(2): 322-331.

Figures/Tables 12

Fig. 1 Structure of BPA-SAM

Fig. 2 Images and annotation samples in SegTCRP

Table 1 Number of foreground categories instances in SegTCRP

类别	数量/个
花	1294
鸟	435
鱼	31
虫	214
印章	1299
总计	3273

Fig. 3 Experimental results of seven algorithms on SegTCRP ((a) Input images; (b) U-Net; (c) FCN; (d) PSPNet; (e) DeepLabV3+; (f) SegFormer; (g) SAM; (h) SAM-LoRA; (i) True labeling)

Table 2 Comparison of segmentation accuracy of different models on SegTCRP/%

模型	图像				平均
模型	花	鸟	虫	印章	平均
U-Net	65.57	80.13	65.10	82.26	73.27
FCN	54.23	57.93	58.06	76.46	61.82
PSPNet	53.91	69.92	57.68	60.49	60.51
DeepLabV3+	64.11	74.16	72.59	76.43	71.82
SegFormer	65.83	76.24	73.41	77.52	73.25
SAM	83.51	83.26	89.44	87.52	85.93
SAM-LoRA	90.38	86.52	92.32	95.10	91.08

Table 3 Comparison of segmentation accuracy of different point prompt generation strategies in BPA-SAM

阈值	前景点	背景点	分割精度/%
阈值	前景点	背景点	随机	最大熵	最远距离
5	2	0	92.65	92.73	92.83
	0	2	91.69	92.57	92.81
	2	2	91.74	92.37	92.46
15	2	0	92.66	92.91	92.85
	0	2	91.06	92.55	92.70
	2	2	92.72	92.17	92.31
25	2	0	92.42	92.64	92.48
	0	2	91.67	92.58	92.74
	2	2	91.93	92.03	91.81

Fig. 4 Selection results of three selection strategies at a 15% threshold ((a) Random strategies; (b) Maximum entropy strategy; (c) The farthest distance strategy)

Fig. 5 Segmentation results of BPA-SAM under three selection strategies with two foreground points at a 15% threshold ((a) Input images; (b) SAM-LoRA; (c) Random selection; (d) Maximum entropy selection; (e) Farthest distance selection; (f) True labeling)

Fig. 6 Selection results of the maximum distance strategy under different thresholds ((a) The 5 per cent threshold; (b) The 15 per cent threshold; (c) The 25 per cent threshold)

Table 4 Comparison of segmentation accuracy of point prompt generation strategies on SAM

阈值/%	前景点	背景点	分割精度/%
阈值/%	前景点	背景点	随机	最大熵	最远距离
5	2	0	87.63	86.98	86.31
	0	2	81.96	79.28	80.39
	2	2	86.33	85.04	85.47
15	2	0	85.34	86.95	86.87
	0	2	81.02	78.76	82.63
	2	2	83.98	85.77	85.41
25	2	0	86.06	86.53	85.76
	0	2	82.62	79.48	82.69
	2	2	84.64	85.82	84.49

Table 5 Comparison of segmentation accuracy of different fine-tuning methods/%

方法	DSC
图像编码器	91.08
掩码解码器	86.37
图像编码器+掩码解码器	89.67

Table 6 Comparison of segmentation accuracy of three strategies for multiple foreground prompt points

阈值	前景点	分割精度/%
阈值	前景点	随机	最大熵	最远距离
5	4	91.19	91.49	91.27
	6	90.55	90.98	90.51
	8	89.94	90.96	90.23
15	4	91.72	91.94	91.12
	6	90.35	90.27	90.33
	8	89.06	90.12	90.14

References 39

[1]	ZHANG W, KAM-KWAI W, CHEN Y T, et al. ScrollTimes: tracing the provenance of paintings as a window into history[J]. IEEE Transactions on Visualization and Computer Graphics, 2024, 30(6): 2981-2994.
[2]	SHI H Z, XU D, HE K J, et al. Contrastive learning for a single historical painting’s blind super-resolution[J]. Visual Informatics, 2021, 5(4): 81-88.
[3]	LI M, WANG Y, XU Y Q. Computing for Chinese cultural heritage[J]. Visual Informatics, 2022, 6(1): 1-13.
[4]	HUANG L L, PENG J F, ZHANG R M, et al. Learning deep representations for semantic image parsing: a comprehensive overview[J]. Frontiers of Computer Science, 2018, 12(5): 840-857. DOI
[5]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 3992-4003.
[6]	WANG X L, ZHANG X S, CAO Y, et al. SegGPT: towards segmenting everything in context[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 1130-1140.
[7]	WU J D, JI W, LIU Y P, et al. Medical SAM adapter: adapting segment anything model for medical image segmentation[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2304.12620.
[8]	CHENG J L, YE J, DENG Z Y, et al. SAM-med2D[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2308.16184.
[9]	ZHANG K D, LIU D. Customized segment anything model for medical image segmentation[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2304.13785.
[10]	SULTAN R I, LI C Y, ZHU H, et al. GeoSAM: fine-tuning SAM with sparse and dense visual prompting for automated segmentation of mobility infrastructure[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2311.11319.
[11]	CHEN K Y, LIU C Y, CHEN H, et al. RSPrompter: learning to prompt for remote sensing instance segmentation based on visual foundation model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4701117.
[12]	WANG C M, WANG R J. Image-based color ink diffusion rendering[J]. IEEE Transactions on Visualization and Computer Graphics, 2007, 13(2): 235-246.
[13]	CHEN X J, LIU Q H, CHEN Y H, et al. ColorNetVis: an interactive color network analysis system for exploring the color composition of traditional Chinese painting[J]. IEEE Transactions on Visualization and Computer Graphics, 2024, 30(6): 2916-2928.
[14]	COHEN N, NEWMAN Y, SHAMIR A. Semantic segmentation in art paintings[J]. Computer Graphics Forum, 2022, 41(2): 261-275.
[15]	JIANG S Q, HUANG Q M, YE Q X, et al. An effective method to detect and categorize digitized traditional Chinese paintings[J]. Pattern Recognition Letters, 2006, 27(7): 734-746.
[16]	SUN M J, ZHANG D, WANG Z, et al. Monte Carlo convex hull model for classification of traditional Chinese paintings[J]. Neurocomputing, 2016, 171: 788-797.
[17]	WANG Z, LU D Y, ZHANG D, et al. Fake modern Chinese painting identification based on spectral-spatial feature fusion on hyperspectral image[J]. Multidimensional Systems and Signal Processing, 2016, 27(4): 1031-1044.
[18]	REN T H, LIU S L, ZENG A L, et al. Grounded SAM: assembling open-world models for diverse visual tasks[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2401.14159.
[19]	WANG H X, VASU P K A, FAGHRI F, et al. SAM-CLIP: merging vision foundation models towards semantic and spatial understanding[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2024: 3635-3647.
[20]	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2106.09685.
[21]	MAO Y R, GE Y H, FAN Y J, et al. A survey on LoRA of large language models[J]. Frontiers of Computer Science, 2025, 19(7): 197605. DOI
[22]	高峰, 聂婕, 黄磊, 等. 基于表现手法的国画分类方法研究[J]. 计算机学报, 2017, 40(12): 2871-2882.
	GAO F, NIE J, HUANG L, et al. Traditional Chinese painting classification based on painting techniques[J]. Chinese Journal of Computers, 2017, 40(12): 2871-2882 (in Chinese).
[23]	盛家川, 李玉芝. 国画的艺术目标分割及深度学习与分类[J]. 中国图象图形学报, 2018, 23(8): 1193-1206.
	SHENG J C, LI Y Z. Learning artistic objects for improved classification of Chinese paintings[J]. Journal of Image and Graphics, 2018, 23(8): 1193-1206 (in Chinese).
[24]	HU Q Y, ZHOU W L, PENG X L, et al. DRANet: a semantic segmentation network for Chinese landscape paintings[J]. Digital Signal Processing, 2024, 147: 104427.
[25]	MA J, HE Y T, LI F F, et al. Segment anything in medical images[J]. Nature Communications, 2024, 15(1): 654. DOI PMID
[26]	CHEN T R, ZHU L Y, DING C T, et al. SAM fails to segment anything?-SAM-adapter: adapting SAM in underperformed scenes:camouflage, shadow, medical image segmentation, and more[EB/OL]. [2024-04-19]. https://arxiv.org/pdf/2304.09148.
[27]	JULKA S, GRANITZER M. Knowledge distillation with segment anything (SAM) model for planetary geological mapping[C]// The 9th International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer, 2023: 68-77.
[28]	RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]// The 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.
[29]	DAI H X, MA C, LIU Z L, et al. SAMAug: point prompt augmentation for segment anything model[EB/OL]. [2024-04-19]. https://arxiv.org/pdf/2307.01187.
[30]	TANG F, DONG W M, MENG Y P, et al. Animated construction of Chinese brush paintings[J]. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(12): 3019-3031.
[31]	LAI Y C, CHEN B A, CHEN K W, et al. Data-driven NPR illustrations of natural flows in Chinese painting[J]. IEEE Transactions on Visualization and Computer Graphics, 2017, 23(12): 2535-2549.
[32]	陈逸天, 张玮, 谭思危, 等. 历史人物群体对比可视化[J]. 图学学报, 2023, 44(6): 1227-1238. DOI
	CHEN Y T, ZHANG W, TAN S W, et al. Visualization comparison of historical figures cohorts[J]. Journal of Graphics, 2023, 44(6): 1227-1238 (in Chinese).
[33]	王斯加, 封颖超杰, 朱航, 等. TCPVis: 基于谢赫六法的传统中国绘画画派可视分析系统[J]. 图学学报, 2024, 45(1): 209-218. DOI
	WANG S J, FENG Y C J, ZHU H, et al. TCPVis: visual analysis system of traditional Chinese painting school based on six principles of Chinese painting[J]. Journal of Graphics, 2024, 45(1): 209-218 (in Chinese). DOI
[34]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words:transformers for image recognition at scale[EB/OL]. [2024-04-19]. https://arxiv.org/abs/2010.11929.
[35]	XIE E Z, WANG W H, YU Z D, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[EB/OL]. [2024-08-28]. https://proceedings.neurips.cc/paper_files/paper/2021/file/64f1f27bf1b4ec22924fd0acb550c235-Supplemental.pdf.
[36]	EVERINGHAM M, ESLAMI S M A, VAN GOOL L, et al. The PASCAL visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111(1): 98-136.
[37]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2015: 3431-3440.
[38]	ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 6230-6239.
[39]	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 833-841.