基于语义引导的人像自动抠图模型

doi:10.11996/JG.j.2095-302X.2024040683

图学学报 ›› 2024, Vol. 45 ›› Issue (4): 683-695.DOI: 10.11996/JG.j.2095-302X.2024040683

• 图像处理与计算机视觉 • 上一篇下一篇

基于语义引导的人像自动抠图模型

程艳¹^,⁴(), 严志航²^,⁴, 赖建明²^,⁴, 王桂喜²^,⁴, 钟林辉³^,⁴

1.江西师范大学软件学院，江西南昌 330022
2.江西师范大学数字产业学院，江西上饶 334000
3.江西师范大学计算机信息与工程学院，江西南昌 330022
4.江西省智能信息处理与情感计算省重点实验室，江西南昌 330022

收稿日期:2024-02-27 接受日期:2024-05-10 出版日期:2024-08-31 发布日期:2024-09-03
第一作者:程艳(1976-)，女，教授，博士。主要研究方向为人工智能和图像处理。E-mail：chyan88888@jxnu.edu.cn
基金资助:
国家自然科学基金项目(62167006);国家自然科学基金项目(61967011);江西省科技创新基地计划省重点实验室项目(2024SSY03131);江西省自然科学基金项目(20212BAB202017);江西省03专项及5G项目(20212ABC03A22);江西省主要学科学术和技术带头人培养计划领军人才项目(20213BCJL22047)

Automatic portrait matting model based on semantic guidance

CHENG Yan¹^,⁴(), YAN Zhihang²^,⁴, LAI Jianming²^,⁴, WANG Guixi²^,⁴, ZHONG Linhui³^,⁴

1. School of Software, Jiangxi Normal University, Nanchang Jiangxi 330022, China
2. School of Digital Industry, Jiangxi Normal University, Shangrao Jiangxi 334000, China
3. School of Computer Information and Engineering, Jiangxi Normal University, Nanchang Jiangxi 330022, China
4. Key Laboratory of Intelligent Information Processing and Emotional Computing of Jiangxi Province, Nanchang Jiangxi 330022, China

Received:2024-02-27 Accepted:2024-05-10 Published:2024-08-31 Online:2024-09-03
First author：CHENG Yan (1976-), professor, Ph.D. Her main research interests cover artificial intelligence and image processing. E-mail：chyan88888@jxnu.edu.cn
Supported by:
National Natural Science Foundation of China(62167006);National Natural Science Foundation of China(61967011);Jiangxi Province Science and Technology Innovation Base Plan Jiangxi Province Key Laboratory Project(2024SSY03131);Natural Science Foundation of Jiangxi Province(20212BAB202017);Jiangxi Province 03 Special and 5G Projects(20212ABC03A22);Jiangxi Province Major Disciplines Academic and Technical Leaders Training Plan Leading Talent Project(20213BCJL22047)

摘要/Abstract

摘要：

为解决现有人像抠图方法中存在的语义判别错误和抠图细节模糊问题，提出一种基于语义引导的人像自动抠图模型。首先引入CNN-Transformer混合架构EMO进行特征编码。接着，在语义分割解码分支利用多尺度混合注意力模块处理最高层编码特征，以增强多尺度表征和像素级判别能力。然后，使用特征增强模块融合高层次特征，促使高层语义信息在浅层网络的流动。同时，细节抠取解码分支中的聚合以引导来自模块不同分支的特征聚合，利用聚合特征更好地引导网络提取浅层特征，提高了边缘细节抠取精度。在3个数据集上的实验表明，该方法与所比较方法相比性能达到了最优，并显著降低了参数量和计算复杂度，具有较高的竞争力。

关键词: 人像抠图, 语义引导, 多尺度, 特征增强, 聚合引导

Abstract:

To address the issues of semantic discrimination errors and unclear details in existing portrait matting methods, an automatic matting model based on semantic guidance was proposed.Firstly, a hybrid CNN-Transformer architecture EMO was introduced for feature encoding. Then, the semantic segmentation decoding branch utilized a multi-scale hybrid attention module to process the top-level encoded features, enhancing multi-scale representation and pixel-level discrimination capabilities. Next, a feature enhancement module was employed to merge high-level features, facilitating the flow of high-level semantic information through the shallow network. Simultaneously, the aggregation guidance module in the detail extraction decoding branch aggregated features from different branches, utilizing the aggregated features to better guide the network in extracting shallow features, thereby improving the accuracy of edge and detail extraction. Experiments on three datasets demonstrated that our approach outperformed the compared methods, achieving optimal performance while significantly reducing parameter count and computational complexity, validating the competitiveness of our proposed method.

Key words: portrait matting, semantic guidance, multi-scale, feature enhancement, aggregation guidance

中图分类号:

TP391

程艳, 严志航, 赖建明, 王桂喜, 钟林辉. 基于语义引导的人像自动抠图模型[J]. 图学学报, 2024, 45(4): 683-695.

CHENG Yan, YAN Zhihang, LAI Jianming, WANG Guixi, ZHONG Linhui. Automatic portrait matting model based on semantic guidance[J]. Journal of Graphics, 2024, 45(4): 683-695.

图/表 22

参考文献 47

[1]	ZHANG J, TAO D C. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things[J]. IEEE Internet of Things Journal, 2021, 8(10): 7789-7817.
[2]	CHEN Q F, LI D, TANG C K. KNN matting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(9): 2175-2188. DOI PMID
[3]	PORTER T, DUFF T. Compositing digital images[C]// The 11th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM, 1984: 253-259.
[4]	AKSOY Y, AYDIN T O, POLLEFEYS M. Designing effective inter-pixel information flow for natural image matting[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 228-236.
[5]	CHUANG Y Y, CURLESS B, SALESIN D H, et al. A Bayesian approach to digital matting[C]// 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE Computer Society, 2001: 264.
[6]	GASTAL E S L, OLIVEIRA M M. Shared sampling for real-time alpha matting[J]. Computer Graphics Forum, 2010, 29(2): 575-584.
[7]	LEVIN A, LISCHINSKI D, WEISS Y. A closed-form solution to natural image matting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(2): 228-242. PMID
[8]	LEVIN A, RAV-ACHA A, LISCHINSKI D. Spectral matting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(10): 1699-1712. DOI PMID
[9]	SUN J, JIA J Y, TANG C K, et al. Poisson matting[J]. ACM Transactions on Graphics, 23(3): 315-321.
[10]	PHAM V Q, TAKAHASHI K, NAEMURA T. Real-time video matting based on bilayer segmentation[C]// Asian Conference on Computer Vision. Berlin, Heidelberg: Springer, 2010: 489-501.
[11]	XU N, PRICE B, COHEN S, et al. Deep image matting[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 311-320.
[12]	LIU Y H, XIE J K, SHI X, et al. Tripartite information mining and integration for image matting[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 7535-7544.
[13]	SENGUPTA S, JAYARAM V, CURLESS B, et al. Background matting: the world is your green screen[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 2288-2297.
[14]	LIN S C, YANG L J, SALEEMI I, et al. Robust high-resolution video matting with temporal guidance[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2022: 3132-3141.
[15]	彭泓, 张家宝, 贾迪, 等. 结合背景图的高分辨率视频人像实时抠图网络[J]. 中国图象图形学报, 2024, 29(2): 478-490.
	PENG H, ZHANG J B, JIA D, et al. Real-time high-resolution video portrait matting network combined with background image[J]. Journal of Image and Graphics, 2024, 29(2): 478-490 (in Chinese).
[16]	CHEN Q, GE T Z, XU Y Y, et al. Semantic human matting[C]// The 26th ACM International Conference on Multimedia. New York: ACM, 2018: 618-626.
[17]	DEORA R, SHARMA R, RAJ D S S. Salient image matting[EB/OL]. [2023-10-19]. http://arxiv.org/abs/2103.12337.
[18]	SHARMA R, DEORA R, VISHVAKARMA A. AlphaNet: an attention guided deep network for automatic image matting[C]// 2020 International Conference on Omni-layer Intelligent Systems. New York: IEEE Press, 2020: 1-8.
[19]	ZHANG Y K, GONG L X, FAN L B, et al. A late fusion CNN for digital matting[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 7461-7470.
[20]	QIAO Y, LIU Y H, YANG X, et al. Attention-guided hierarchical structure aggregation for image matting[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 13673-13682.
[21]	LI J, ZHANG J, MAYBANK S J, et al. Bridging composite and real: towards end-to-end deep image matting[J]. International Journal of Computer Vision, 2022, 130(2): 246-266.
[22]	LI J, ZHANG J, TAO D C. Deep automatic natural image matting[EB/OL]. [2023-10-19]. https://arxiv.org/abs/2107.07235.
[23]	苏常保, 龚世才. 基于深度学习的人物肖像全自动抠图算法[J]. 图学学报, 2022, 43(2): 247-253.
	SU C B, GONG S C. Fully automatic matting algorithm for portraits based on deep learning[J]. Journal of Graphics, 2022, 43(2): 247-253 (in Chinese). DOI
[24]	KE Z H, SUN J Y, LI K C, et al. MODNet: real-time trimap-free portrait matting via objective decomposition[C]// 2022 AAAI Conference on Artificial Intelligence. Vancouver: AAAI Press, 2022: 1140-1147.
[25]	MA S H, LI J, ZHANG J, et al. Rethinking portrait matting with privacy preserving[J]. International Journal of Computer Vision, 2023, 131(8): 2172-2197.
[26]	O’SHEA K, NASH R. An introduction to convolutional neural networks[EB/OL]. [2023-10-19]. http://arxiv.org/abs/1511.08458.
[27]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// The 31st International Conference on Neural Information Processing Systems, Red Hook: NIPS, 2017: 6000-6010.
[28]	ZHANG J N, LI X T, LI J, et al. Rethinking mobile block for efficient attention-based models[EB/OL]. [2023-10-19]. http://arxiv.org/abs/2301.01146.
[29]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 770-778.
[30]	HOU Q B, ZHANG L, CHENG M M, et al. Strip pooling: rethinking spatial pooling for scene parsing[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 4002-4011.
[31]	HUANG Z L, WANG X G, WEI Y C, et al. CCNet: criss-cross attention for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6896-6908.
[32]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3141-3149.
[33]	GUO M H, LU C Z, HOU Q B, et al. SegNeXt: rethinking convolutional attention design for semantic segmentation[EB/OL]. [2023-10-19]. http://arxiv.org/abs/2209.08575.
[34]	TAKIKAWA T, ACUNA D, JAMPANI V, et al. Gated-SCNN: gated shape CNNs for semantic segmentation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 5228-5237.
[35]	HOU Q Q, LIU F. Context-aware image matting for simultaneous foreground and alpha estimation[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 4129-4138.
[36]	KINGMA D, BA J. Adam: a method for stochastic optimization[C]// 2015 International Conference on Learning Representations. San Diego: ICLR, 2015: 1-15.
[37]	YU Q H, ZHANG J M, ZHANG H, et al. Mask guided matting via progressive refinement network[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1154-1163.
[38]	RHEMANN C, ROTHER C, WANG J, et al. A perceptually motivated online benchmark for image matting[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2009: 1826-1833.
[39]	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 2261-2269.
[40]	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2017: 5987-5995.
[41]	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 4510-4520.
[42]	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical Vision Transformer using Shifted Windows[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 9992-10002.
[43]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[44]	OUYANG D L, HE S, ZHANG G Z, et al. Efficient multi-scale attention module with cross-spatial learning[C]// 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. New York: IEEE Press, 2023: 1-5.
[45]	PAN X R, GE C J, LU R, et al. On the integration of self-attention and convolution[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 805-815.
[46]	GE C J, DING X H, TONG Z, et al. Advancing vision transformers with group-mix attention[EB/OL]. [2023-10-19]. http://arxiv.org/abs/2311.15157.
[47]	MEHTA S, RASTEGARI M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. [2023-10-19]. http://arxiv.org/abs/2110.02178.

编码器变体	各阶段iRMB 堆叠次数	通道大小	参数量/M
EMO_T	(2，2，8，3)	(32，48，80，168)	1.3
EMO_S	(3，3，9，3)	(32，48，120，200)	2.3
EMO_B	(3，3，9，3)	(48，72，160，288)	5.1

编码器变体	各阶段iRMB 堆叠次数	通道大小	参数量/M
EMO_T	(2，2，8，3)	(32，48，80，168)	1.3
EMO_S	(3，3，9，3)	(32，48，120，200)	2.3
EMO_B	(3，3，9，3)	(48，72，160，288)	5.1

方法	骨干网络	SAD↓	MSE↓	MAD↓	Grad↓	Conn↓
LFM	DenseNet-201	31.65	12.76	16.81	30.29	18.74
HATT	ResNeXt-101	25.97	7.21	15.42	25.29	14.91
SHM	ResNet-50	21.95	9.93	12.76	18.17	27.06
GFM	ResNet-34	13.09	5.12	7.68	17.34	12.63
MODNet	MobileNetV2	12.93	4.57	7.54	13.31	12.38
P3M	ResNet-34	10.29	3.64	5.98	13.69	10.92
P3M-Swin	Swin	9.60	3.47	5.72	13.21	9.13
APM-SG(T)	EMO_T	9.49	3.34	5.56	13.17	9.26
APM-SG(S)	EMO_S	8.56	2.78	4.92	12.49	8.09
APM-SG(B)	EMO_B	8.09	2.59	4.63	12.10	7.68

方法	骨干网络	SAD↓	MSE↓	MAD↓	Grad↓	Conn↓
LFM	DenseNet-201	31.65	12.76	16.81	30.29	18.74
HATT	ResNeXt-101	25.97	7.21	15.42	25.29	14.91
SHM	ResNet-50	21.95	9.93	12.76	18.17	27.06
GFM	ResNet-34	13.09	5.12	7.68	17.34	12.63
MODNet	MobileNetV2	12.93	4.57	7.54	13.31	12.38
P3M	ResNet-34	10.29	3.64	5.98	13.69	10.92
P3M-Swin	Swin	9.60	3.47	5.72	13.21	9.13
APM-SG(T)	EMO_T	9.49	3.34	5.56	13.17	9.26
APM-SG(S)	EMO_S	8.56	2.78	4.92	12.49	8.09
APM-SG(B)	EMO_B	8.09	2.59	4.63	12.10	7.68

方法	骨干网络	SAD↓	MSE↓	MAD↓	Grad↓	Conn↓
LFM	DenseNet-201	40.71	16.34	23.72	41.36	17.63
HATT	ResNeXt-101	30.53	9.18	17.63	27.42	19.88
SHM	ResNet-50	23.63	10.61	13.68	15.17	28.52
GFM	ResNet-34	14.98	9.04	8.65	17.57	14.68
MODNet	MobileNetV2	15.68	6.14	9.23	13.63	15.29
P3M	ResNet-34	12.88	4.63	7.42	12.85	12.31
P3M-Swin	Swin	9.84	3.24	5.74	11.45	9.43
APM-SG(T)	EMO_T	9.69	3.13	5.65	11.20	9.25
APM-SG(S)	EMO_S	8.77	2.65	5.13	10.82	8.32
APM-SG(B)	EMO_B	8.62	2.49	5.02	10.67	8.18

基于语义引导的人像自动抠图模型

Automatic portrait matting model based on semantic guidance

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 22

参考文献 47

相关文章 15

编辑推荐

Metrics

本文评价

方法	骨干网络	SAD↓	MSE↓	MAD↓	Grad↓	Conn↓
LFM	DenseNet-201	73.54	45.46	53.27	72.24	68.34
HATT	ResNeXt-101	52.36	23.84	36.32	73.27	48.31
SHM	ResNet-50	55.83	26.21	36.90	69.25	53.74
GFM	ResNet-34	40.89	24.02	32.46	62.03	40.03
MODNet	MobileNetV2	43.01	24.95	35.15	67.93	42.27
P3M	ResNet-34	38.91	21.84	29.84	60.11	38.82
P3M-Swin	Swin	35.42	20.24	28.27	61.15	35.43
APM-SG(T)	EMO_T	34.98	19.12	26.82	59.78	34.95
APM-SG(S)	EMO_S	33.14	18.01	26.11	60.69	33.13
APM-SG(B)	EMO_B	32.67	17.32	25.74	59.45	32.67

方法	参数量/M	计算量/GFLOPs
LFM	37.90	1502.50
HATT	107.00	870.20
SHM	79.30	40.30
GFM	55.30	1518.60
MODNet	6.50	34.60
P3M	39.50	160.20
P3M-Swin	45.13	177.00
APM-SG(T)	3.71	23.08
APM-SG(S)	5.92	26.56
APM-SG(B)	12.26	50.44

MAM	FEM	AGM	SAD↓	MSE↓	MAD↓
×	×	×	10.98	3.54	6.38
×	×	√	10.52	3.48	5.94
√	×	√	9.72	3.39	5.68
×	√	√	9.93	3.41	5.70
√	√	√	9.49	3.34	5.56

是否进行损失计算	SAD↓	MSE↓	MAD↓
-损失计算	9.62	3.38	5.60
+损失计算	9.49	3.34	5.56

PAM和CAM的位置关系	SAD↓	MSE↓	MAD↓
PAM串联在前	9.54	3.36	5.57
CAM串联在前	9.56	3.38	5.59
并行放置	9.49	3.34	5.56

功能	SAD↓	MSE↓	MAD↓
引导	11.25	3.77	6.52
聚合	10.34	3.82	5.98
聚合+引导	9.49	3.34	5.56

注意力机制	SAD↓	MSE↓	MAD↓
CBAM	10.07	3.63	5.83
EMA	10.57	3.79	6.13
ACmix	11.08	4.12	6.43
GMA	11.25	3.77	6.52
CAM/PAM	9.49	3.34	5.56

编码器	SAD↓	MSE↓	MAD↓	计算量/GFLOPs
ResNet-34	10.26	3.81	5.93	186.90
MobileViT-xxs	10.04	3.44	5.70	30.50
EMO_T	9.49	3.34	5.56	23.08

[1]	刘丽, 张起凡, 白宇昂, 黄凯烨. 结合Swin Transformer的多尺度遥感图像变化检测研究[J]. 图学学报, 2024, 45(5): 941-956.
[2]	李大湘, 吉展, 刘颖, 唐垚. 改进YOLOv7遥感图像目标检测算法[J]. 图学学报, 2024, 45(4): 650-658.
[3]	张新宇, 张家意, 高欣. ASC-Net：腹腔镜视频中手术器械与脏器快速分割网络[J]. 图学学报, 2024, 45(4): 659-669.
[4]	朱光辉, 缪君, 胡宏利, 申基, 杜荣华. 基于自增强注意力机制的室内单图像分段平面三维重建[J]. 图学学报, 2024, 45(3): 464-471.
[5]	郭宗洋, 刘立东, 蒋东华, 刘子翔, 朱熟康, 陈京华. 基于语义引导神经网络的人体动作识别算法[J]. 图学学报, 2024, 45(1): 26-34.
[6]	范腾, 杨浩, 尹稳, 周冬明. 基于神经辐射场的多尺度视图合成研究[J]. 图学学报, 2023, 44(6): 1140-1148.
[7]	张丽媛, 赵海蓉, 何巍, 唐雄风. 融合全局-局部注意模块的Mask R-CNN膝关节囊肿检测方法[J]. 图学学报, 2023, 44(6): 1183-1190.
[8]	石佳豪, 姚莉. 基于语义引导的视频描述生成[J]. 图学学报, 2023, 44(6): 1191-1201.
[9]	云峰, 王有治, 宋娇, 耿磊, 张乘虎, 刘继凯. 增材制造自支撑点阵-实体复合结构拓扑优化方法[J]. 图学学报, 2023, 44(5): 1013-1020.
[10]	李雨, 闫甜甜, 周东生, 魏小鹏. 基于注意力机制与深度多尺度特征融合的自然场景文本检测[J]. 图学学报, 2023, 44(3): 473-481.
[11]	陆秋, 邵铧泽, 张云磊. 动态平衡多尺度特征融合的结直肠息肉分割[J]. 图学学报, 2023, 44(2): 225-232.
[12]	罗启明, 吴昊, 夏信, 袁国武. 基于Dual Dense U-Net的云南壁画破损区域预测[J]. 图学学报, 2023, 44(2): 304-312.
[13]	张倩, 王夏黎, 王炜昊, 武历展, 李超. 基于多尺度特征融合的细胞计数方法[J]. 图学学报, 2023, 44(1): 41-49.
[14]	单芳湄, 王梦文, 李敏. 融合注意力机制的肠道息肉分割多尺度卷积神经网络[J]. 图学学报, 2023, 44(1): 50-58.
[15]	黄志勇, 韩莎莎, 陈致君, 姚玉, 熊彪, 马凯. 一种用于视频对象分割的仿U形网络[J]. 图学学报, 2023, 44(1): 104-111.