面向单目可见光环境的自适应双手重建网络

doi:10.11996/JG.j.2095-302X.2025040837

图学学报 ›› 2025, Vol. 46 ›› Issue (4): 837-846.DOI: 10.11996/JG.j.2095-302X.2025040837

• 计算机图形学与虚拟现实 • 上一篇下一篇

面向单目可见光环境的自适应双手重建网络

廖国琼¹^,²(), 黄龙杰¹, 李清新², 辜勇³, 李海波¹^,⁴

1.江西财经大学虚拟现实(VR)现代产业学院，江西南昌 330032
2.江西财经大学信息管理与数学学院，江西南昌 330013
3.江西财经大学软件与物联网工程学院，江西南昌 330013
4.瑞典皇家理工学院，斯德哥尔摩 SE-100 44

收稿日期:2024-10-12 修回日期:2025-02-18 出版日期:2025-08-30 发布日期:2025-08-11
第一作者:廖国琼(1969-)，男，教授，博士。主要研究方向为人机交互。E-mail：liaoguoqiong@163.com
基金资助:
江西省研究生创新专项(YC2024-S392)

Adaptive two-hand reconstruction network for monocular visible light environments

LIAO Guoqiong¹^,²(), HUANG Longjie¹, LI Qingxin², GU Yong³, LI Haibo¹^,⁴

1. Modern Industry School of Virtual Reality (VR), Jiangxi University of Finance and Economics, Nanchang Jiangxi 330032, China
2. School of Information Management and Math, Jiangxi University of Finance and Economics, Nanchang Jiangxi 330013, China
3. School of Software and Internet of Things Engineering, Jiangxi University of Finance and Economics, Nanchang Jiangxi 330013, China
4. KTH Royal Institute of Technology, Stockholm SE-100 44, Sweden

Received:2024-10-12 Revised:2025-02-18 Published:2025-08-30 Online:2025-08-11
First author：LIAO Guoqiong (1969-), professor, Ph.D. His main research interest covers human-computer interaction. E-mail：liaoguoqiong@163.com
Supported by:
Graduate Innovation Special Project in Jiangxi Province(YC2024-S392)

摘要/Abstract

摘要：

准确重建双手手部网格对于自然的人机交互体验来说是一个至关重要的过程，但由于双手的遮挡、户外收集双手交互数据集的复杂性和复杂的光照环境干扰等因素导致双手手部重建任务仍极具挑战性。目前已有的工作大多是在环境干扰比较小的实验室等场景下取得的的良好效果，而在复杂的光照场景中的重建效果仍不佳。为了解决上述问题，提出一种面向单目可见光环境的自适应手部重建网络。通过引入单手检测框和使用2D复杂光照场景数据集进行弱监督等策略使得模型得以对复杂光照场景产生泛化性；设计的双手特征交互器得以有效建立左右手特征的远距离依赖关系，缓解了单手检测框缺乏双手交互信息的问题；针对如何有效融合交互特征与单手特征的问题，设计了自适应融合的策略，增强了模型的鲁棒性。实验结果表明，在包含多个复杂光照场景的HIC数据集中取得了最佳的效果。

关键词: 复杂光照场景, 手部网格, 双手交互, 弱监督, 特征融合

Abstract:

An accurate reconstruction of the hand mesh is a crucial process for a natural human-computer interaction experience, but the task of hand reconstruction remains highly challenging due to factors such as hand occlusion, the complexity of collecting hand interaction data outdoors, and interference in complex lighting environments. Most of the existing work has achieved good results in laboratory and other environments with less interference, but the reconstruction performance in complex lighting scenes remains poor. To solve these problems, an adaptive two-hand reconstruction network was proposed for monocular visible light environments. By introducing a single hand detection frame and using a 2D complex lighting scene dataset for weak supervision, the model can enable generalization to complex lighting scenarios. The designed hand feature interaction module effectively established long-distance dependence relationships between the left and right hand features, alleviating the problem of the single hand detection frame lacking hand interaction information. The designed adaptive fusion strategy effectively integrated interaction features and single hand features, enhancing the robustness of the model. Experimental results demonstrated that the best results were achieved on the HIC dataset, comprising multiple complex lighting scenarios.

Key words: complex lighting, hand mesh, two-hand interaction, weak supervision, feature fusion

中图分类号:

廖国琼, 黄龙杰, 李清新, 辜勇, 李海波. 面向单目可见光环境的自适应双手重建网络[J]. 图学学报, 2025, 46(4): 837-846.

LIAO Guoqiong, HUANG Longjie, LI Qingxin, GU Yong, LI Haibo. Adaptive two-hand reconstruction network for monocular visible light environments[J]. Journal of Graphics, 2025, 46(4): 837-846.

图/表 16

图1 现有方法在双手交互过程中的网格重建((a) 输入图片；(b) 建模效果)

Fig. 1 The existing methods for mesh reconstruction during the process of two-handed interaction ((a) Input image; (b) Modeling results)

图2 AHRNet总体架构

Fig. 2 AHRNet overall architecture

图3 多尺度残差网络模块

Fig. 3 Multi scale residual network module

图4 双手特征交互器总体架构

Fig. 4 Overall architecture of dual hand feature interactor

图5 自适应手部交互网络

Fig. 5 Adaptive hand interactive network

表1 自适应手部重建网络不同模块对比

Table 1 Comparison of different modules in adaptive hand reconstruction network

翻转左手	双手特征交互器	MPVPE			MRRPE
翻转左手	双手特征交互器	Single	Two	All	MRRPE
		13.23	14.05	13.68	36.14
√		12.86	13.75	13.28	33.78
√	√	9.77	12.39	11.80	26.03

图6 手部翻转策略对比((a) 原始图片；(b) 采用翻转策略；(c) 未采用翻转策略)

Fig. 6 Comparison of hand flip strategies ((a) Original image; (b) Flip strategy; (c) Without flipping strategy)

表2 多尺度残差骨干网络与ResNet50参数量对比

Table 2 Comparison of parameter quantities between multi-scale residual backbone network and ResNet50

网络	Box IOU	参数量/M
ResNet50	86.35	25.60
多尺度残差骨干网络	84.98	8.63

表3 双手特征交互器的自适应融合机制对比

Table 3 Comparison of adaptive fusion mechanisms for two hand feature interactors

融合特征适应到每只手	交互与单手特征自适应融合	MPVPE			MRRPE
融合特征适应到每只手	交互与单手特征自适应融合	Single	Two	All	MRRPE
		12.67	13.65	13.57	30.28
√		11.23	12.98	13.01	29.24
√	√	9.77	12.39	11.80	26.03

表4 TFormer中注意力堆叠层数与手部估计误差和参数量之间的关系

Table 4 The relationship between the number of attention stacking layers, hand estimation error, and parameter quantity in TFormer

注意力层数	模型参数量/M	MPVPE			MRRPE
注意力层数	模型参数量/M	Single	Two	All	MRRPE
4	58.91	10.03	12.72	12.31	26.90
6	65.20	9.77	12.39	11.80	26.03
8	73.41	9.73	12.31	11.74	26.01
12	81.69	9.68	12.28	11.71	25.98

表5 在HIC数据集上的误差对比/mm

Table 5 Error comparison on HIC dataset/mm

方法	MPVPE			MRRPE
方法	Single	Two	All	MRRPE
EANet	29.18	32.66	30.68	76.82
Keypoint	46.96	42.39	45.12	127.31
InterWild	15.53	15.98	15.83	30.39
IntagHand	-	50.13	-	-
AHRNet	15.12	15.59	15.32	30.02

表6 在InterHand2.6M测试集上的误差对比/mm

Table 6 Error comparison on InterHand2.6M test set/mm

方法	MPVPE			MRRPE
方法	Single	Two	All	MRRPE
EANet	8.61	10.23	9.72	31.29
Keypoint	12.16	15.01	13.54	32.96
InterWild	10.09	12.46	11.91	27.71
IntagHand	-	9.48	-	-
AHRNet	9.77	12.39	11.80	26.03

表7 引入真实场景数据集进行训练对实验室场景下的影响对比

Table 7 Comparison of the impact of introducing real-world datasets for training on laboratory scenarios

IH2.6M	COCO	MPVPE			MRRPE
IH2.6M	COCO	Single	Two	All	MRRPE
√		8.53	10.77	10.25	25.33
√	√	9.77	12.39	11.80	26.03

图7 模型参数量对比图

Fig. 7 Comparison chart of model parameter quantity

图8 在实际场景下的手部建模对比((a) 严重遮挡场景；(b) 手部复杂姿态场景)

Fig. 8 Comparison of hand modeling in actual scenarios ((a) Scenarios with severe occlusion; (b) Scenarios with complex hand postures)

图9 在复杂背景下的手部网络渲染结果对比((a) 原始图片；(b) 本文模型；(c) EANet；(d) IntagHand)

Fig. 9 Comparison of hand network rendering results in complex backgrounds ((a) Original image; (b) Ours; (c) EANet; (d) IntagHand)

参考文献 30

[1]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639. DOI
	BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese).
[2]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[3]	郝帅, 赵新生, 马旭, 等. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676. DOI
	HAO S, ZHAO X S, MA X, et al. Multi-class defect target detection method for transmission lines based on TR-YOLOv5[J]. Journal of Graphics, 2023, 44(4): 667-676 (in Chinese). DOI
[4]	CHEN L J, LIN S Y, XIE Y S, et al. MVHM: a large-scale multi-view hand mesh benchmark for accurate 3D hand pose estimation[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 836-845.
[5]	KHALEGHI L, SEPAS-MOGHADDAM A, MARSHALL J, et al. Multiview video-based 3-D hand pose estimation[J]. IEEE Transactions on Artificial Intelligence, 2023, 4(4): 896-909.
[6]	薛皓玮, 王美丽. 融合生物力学约束与多模态数据的手部重建[J]. 图学学报, 2023, 44(4): 794-800. DOI
	XUE H W, WANG M L. Hand reconstruction incorporating biomechanical constraints and multi-modal data[J]. Journal of Graphics, 2023, 44(4): 794-800 (in Chinese).
[7]	REHG J M, KANADE T. DigitEyes: vision-based hand tracking for human-computer interaction[C]// 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects. New York: IEEE Press, 1994: 16-22.
[8]	STENGER B, THAYANANTHAN A, TORR P H S, et al. Model-based hand tracking using a hierarchical Bayesian filter[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(9): 1372-1384. PMID
[9]	CAO Z, RADOSAVOVIC I, KANAZAWA A, et al. Reconstructing hand-object interactions in the wild[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 12397-12406.
[10]	GRADY P, TANG C C, TWIGG C D, et al. ContactOpt: optimizing contact to improve grasps[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1471-1481.
[11]	LIU S W, JIANG H W, XU J R, et al. Semi-supervised 3D hand-object poses estimation with interactions in time[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 14682-14692.
[12]	CAI Y J, GE L H, CAI J F, et al. Weakly-supervised 3D hand pose estimation from monocular RGB images[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 678-694.
[13]	ZIMMERMANN C, BROX T. Learning to estimate 3D hand pose from single RGB images[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 4913-4921.
[14]	ROMERO J, TZIONAS D, BLACK M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics, 2017, 36(6): 245.
[15]	BOUKHAYMA A, DE BEM R, TORR P H S. 3D hand shape and pose from images in the wild[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 10835-10844.
[16]	GU J X, WANG Z H, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[17]	ZHANG B W, WANG Y G, DENG X M, et al. Interacting two-hand 3D pose and shape reconstruction from single color image[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 11334-11343.
[18]	REN Z, YUAN J S, MENG J J, et al. Robust part-based hand gesture recognition using Kinect sensor[J]. IEEE Transactions on Multimedia, 2013, 15(5): 1110-1120.
[19]	MUELLER F, BERNARD F, SOTNYCHENKO O, et al. GANerated hands for real-time 3D hand tracking from monocular RGB[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 49-59.
[20]	DIBRA E, WOLF T, OZTIRELI C, et al. How to refine 3D hand pose estimation from unlabelled depth data?[C]//2017 International Conference on 3D Vision (3DV). New York: IEEE Press, 2017: 135-144.
[21]	LI M C, AN L, ZHANG H W, et al. Interacting attention graph for single image two-hand reconstruction[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2751-2760.
[22]	PARK J, JUNG D S, MOON G, et al. Extract-and-adaptation network for 3D interacting hand mesh recovery[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4202-4211.
[23]	ESHRATIFAR A E, ESMAILI A, PEDRAM M. BottleNet: a deep learning architecture for intelligent mobile cloud computing services[C]// 2019 IEEE/ACM International Symposium on Low Power Electronics and Design. New York: IEEE Press, 2019: 1-6.
[24]	LIN F Q, WILHELM C, MARTINEZ T. Two-hand global 3D pose estimation using monocular RGB[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 2372-2380.
[25]	HUANG H B, ZHOU X Q, CAO J, et al. Vision transformer with super token sampling[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 22690-22699.
[26]	MOON G, YU S, WEN H, et al. InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image[EB/OL]. [2024-06-07]. https://dblp.org/rec/journals/corr/abs-2008-09309.html.
[27]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[28]	TZIONAS D, BALLAN L, SRIKANTHA A, et al. Capturing hands in action using discriminative salient points and physics simulation[J]. International Journal of Computer Vision, 2016, 118(2): 172-193.
[29]	HAMPALI S, SARKAR S D, RAD M, et al. Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11080-11090.
[30]	MOON G. Bringing inputs to shared domains for 3D interacting hands recovery in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17028-17037.

面向单目可见光环境的自适应双手重建网络

Adaptive two-hand reconstruction network for monocular visible light environments

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价

[1]	郭瑞东, 蓝贵文, 范冬林, 钟展, 徐梓睿, 任新月. 基于特征聚焦扩散网络的电力巡检目标检测算法[J]. 图学学报, 2025, 46(4): 719-726.
[2]	闫卓越, 刘骊, 付晓东, 刘利军, 彭玮. 三维人体姿态和形状估计的分层注意力时空特征融合算法[J]. 图学学报, 2025, 46(4): 746-755.
[3]	董佳乐, 邓正杰, 李喜艳, 王诗韵. 基于频域和空域多特征融合的深度伪造检测方法[J]. 图学学报, 2025, 46(1): 104-113.
[4]	卢洋, 陈林慧, 姜晓恒, 徐明亮. SDENet：基于多尺度注意力质量感知的合成缺陷数据评价网络[J]. 图学学报, 2025, 46(1): 94-103.
[5]	闫建红, 冉同霄. 基于YOLOv8的轻量化无人机图像目标检测算法[J]. 图学学报, 2024, 45(6): 1328-1337.
[6]	吴沛宸, 袁立宁, 胡皓, 刘钊, 郭放. 基于注意力特征融合的视频异常行为检测[J]. 图学学报, 2024, 45(5): 922-929.
[7]	刘丽, 张起凡, 白宇昂, 黄凯烨. 结合Swin Transformer的多尺度遥感图像变化检测研究[J]. 图学学报, 2024, 45(5): 941-956.
[8]	章东平, 魏杨悦, 何数技, 徐云超, 胡海苗, 黄文君. 特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法[J]. 图学学报, 2024, 45(5): 968-978.
[9]	罗智徽, 胡海涛, 马潇峰, 程文刚. 基于同质中间模态的跨模态行人再识别方法[J]. 图学学报, 2024, 45(4): 670-682.
[10]	牛为华, 郭迅. 基于改进YOLOv8的船舰遥感图像旋转目标检测算法[J]. 图学学报, 2024, 45(4): 726-735.
[11]	艾列富, 陶勇, 蒋常玉. 基于全局注意力的正交融合图像描述符[J]. 图学学报, 2024, 45(3): 472-481.
[12]	崔克彬, 焦静颐. 基于MCB-FAH-YOLOv8的钢材表面缺陷检测算法[J]. 图学学报, 2024, 45(1): 112-125.
[13]	张丽媛, 赵海蓉, 何巍, 唐雄风. 融合全局-局部注意模块的Mask R-CNN膝关节囊肿检测方法[J]. 图学学报, 2023, 44(6): 1183-1190.
[14]	石佳豪, 姚莉. 基于语义引导的视频描述生成[J]. 图学学报, 2023, 44(6): 1191-1201.
[15]	李利霞, 王鑫, 王军, 张又元. 基于特征融合与注意力机制的无人机图像小目标检测算法[J]. 图学学报, 2023, 44(4): 658-666.