Adaptive two-hand reconstruction network for monocular visible light environments

doi:10.11996/JG.j.2095-302X.2025040837

Abstract

Abstract:

An accurate reconstruction of the hand mesh is a crucial process for a natural human-computer interaction experience, but the task of hand reconstruction remains highly challenging due to factors such as hand occlusion, the complexity of collecting hand interaction data outdoors, and interference in complex lighting environments. Most of the existing work has achieved good results in laboratory and other environments with less interference, but the reconstruction performance in complex lighting scenes remains poor. To solve these problems, an adaptive two-hand reconstruction network was proposed for monocular visible light environments. By introducing a single hand detection frame and using a 2D complex lighting scene dataset for weak supervision, the model can enable generalization to complex lighting scenarios. The designed hand feature interaction module effectively established long-distance dependence relationships between the left and right hand features, alleviating the problem of the single hand detection frame lacking hand interaction information. The designed adaptive fusion strategy effectively integrated interaction features and single hand features, enhancing the robustness of the model. Experimental results demonstrated that the best results were achieved on the HIC dataset, comprising multiple complex lighting scenarios.

Key words: complex lighting, hand mesh, two-hand interaction, weak supervision, feature fusion

CLC Number:

LIAO Guoqiong, HUANG Longjie, LI Qingxin, GU Yong, LI Haibo. Adaptive two-hand reconstruction network for monocular visible light environments[J]. Journal of Graphics, 2025, 46(4): 837-846.

Figures/Tables 16

Fig. 1 The existing methods for mesh reconstruction during the process of two-handed interaction ((a) Input image; (b) Modeling results)

Fig. 2 AHRNet overall architecture

Fig. 3 Multi scale residual network module

Fig. 4 Overall architecture of dual hand feature interactor

Fig. 5 Adaptive hand interactive network

Table 1 Comparison of different modules in adaptive hand reconstruction network

翻转左手	双手特征交互器	MPVPE			MRRPE
翻转左手	双手特征交互器	Single	Two	All	MRRPE
		13.23	14.05	13.68	36.14
√		12.86	13.75	13.28	33.78
√	√	9.77	12.39	11.80	26.03

Fig. 6 Comparison of hand flip strategies ((a) Original image; (b) Flip strategy; (c) Without flipping strategy)

Table 2 Comparison of parameter quantities between multi-scale residual backbone network and ResNet50

网络	Box IOU	参数量/M
ResNet50	86.35	25.60
多尺度残差骨干网络	84.98	8.63

Table 3 Comparison of adaptive fusion mechanisms for two hand feature interactors

融合特征适应到每只手	交互与单手特征自适应融合	MPVPE			MRRPE
融合特征适应到每只手	交互与单手特征自适应融合	Single	Two	All	MRRPE
		12.67	13.65	13.57	30.28
√		11.23	12.98	13.01	29.24
√	√	9.77	12.39	11.80	26.03

Table 4 The relationship between the number of attention stacking layers, hand estimation error, and parameter quantity in TFormer

注意力层数	模型参数量/M	MPVPE			MRRPE
注意力层数	模型参数量/M	Single	Two	All	MRRPE
4	58.91	10.03	12.72	12.31	26.90
6	65.20	9.77	12.39	11.80	26.03
8	73.41	9.73	12.31	11.74	26.01
12	81.69	9.68	12.28	11.71	25.98

Table 5 Error comparison on HIC dataset/mm

方法	MPVPE			MRRPE
方法	Single	Two	All	MRRPE
EANet	29.18	32.66	30.68	76.82
Keypoint	46.96	42.39	45.12	127.31
InterWild	15.53	15.98	15.83	30.39
IntagHand	-	50.13	-	-
AHRNet	15.12	15.59	15.32	30.02

Table 6 Error comparison on InterHand2.6M test set/mm

方法	MPVPE			MRRPE
方法	Single	Two	All	MRRPE
EANet	8.61	10.23	9.72	31.29
Keypoint	12.16	15.01	13.54	32.96
InterWild	10.09	12.46	11.91	27.71
IntagHand	-	9.48	-	-
AHRNet	9.77	12.39	11.80	26.03

Table 7 Comparison of the impact of introducing real-world datasets for training on laboratory scenarios

IH2.6M	COCO	MPVPE			MRRPE
IH2.6M	COCO	Single	Two	All	MRRPE
√		8.53	10.77	10.25	25.33
√	√	9.77	12.39	11.80	26.03

Fig. 7 Comparison chart of model parameter quantity

Fig. 8 Comparison of hand modeling in actual scenarios ((a) Scenarios with severe occlusion; (b) Scenarios with complex hand postures)

Fig. 9 Comparison of hand network rendering results in complex backgrounds ((a) Original image; (b) Ours; (c) EANet; (d) IntagHand)

References 30

[1]	毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639. DOI
	BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639 (in Chinese).
[2]	黄友文, 林志钦, 章劲, 等. 结合坐标Transformer的轻量级人体姿态估计算法[J]. 图学学报, 2024, 45(3): 516-527. DOI
	HUANG Y W, LIN Z Q, ZHANG J, et al. Lightweight human pose estimation algorithm combined with coordinate Transformer[J]. Journal of Graphics, 2024, 45(3): 516-527 (in Chinese). DOI
[3]	郝帅, 赵新生, 马旭, 等. 基于TR-YOLOv5的输电线路多类缺陷目标检测方法[J]. 图学学报, 2023, 44(4): 667-676. DOI
	HAO S, ZHAO X S, MA X, et al. Multi-class defect target detection method for transmission lines based on TR-YOLOv5[J]. Journal of Graphics, 2023, 44(4): 667-676 (in Chinese). DOI
[4]	CHEN L J, LIN S Y, XIE Y S, et al. MVHM: a large-scale multi-view hand mesh benchmark for accurate 3D hand pose estimation[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 836-845.
[5]	KHALEGHI L, SEPAS-MOGHADDAM A, MARSHALL J, et al. Multiview video-based 3-D hand pose estimation[J]. IEEE Transactions on Artificial Intelligence, 2023, 4(4): 896-909.
[6]	薛皓玮, 王美丽. 融合生物力学约束与多模态数据的手部重建[J]. 图学学报, 2023, 44(4): 794-800. DOI
	XUE H W, WANG M L. Hand reconstruction incorporating biomechanical constraints and multi-modal data[J]. Journal of Graphics, 2023, 44(4): 794-800 (in Chinese).
[7]	REHG J M, KANADE T. DigitEyes: vision-based hand tracking for human-computer interaction[C]// 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects. New York: IEEE Press, 1994: 16-22.
[8]	STENGER B, THAYANANTHAN A, TORR P H S, et al. Model-based hand tracking using a hierarchical Bayesian filter[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(9): 1372-1384. PMID
[9]	CAO Z, RADOSAVOVIC I, KANAZAWA A, et al. Reconstructing hand-object interactions in the wild[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 12397-12406.
[10]	GRADY P, TANG C C, TWIGG C D, et al. ContactOpt: optimizing contact to improve grasps[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 1471-1481.
[11]	LIU S W, JIANG H W, XU J R, et al. Semi-supervised 3D hand-object poses estimation with interactions in time[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 14682-14692.
[12]	CAI Y J, GE L H, CAI J F, et al. Weakly-supervised 3D hand pose estimation from monocular RGB images[C]// The 15th European Conference on Computer Vision. Cham: Springer, 2018: 678-694.
[13]	ZIMMERMANN C, BROX T. Learning to estimate 3D hand pose from single RGB images[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 4913-4921.
[14]	ROMERO J, TZIONAS D, BLACK M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics, 2017, 36(6): 245.
[15]	BOUKHAYMA A, DE BEM R, TORR P H S. 3D hand shape and pose from images in the wild[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 10835-10844.
[16]	GU J X, WANG Z H, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[17]	ZHANG B W, WANG Y G, DENG X M, et al. Interacting two-hand 3D pose and shape reconstruction from single color image[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 11334-11343.
[18]	REN Z, YUAN J S, MENG J J, et al. Robust part-based hand gesture recognition using Kinect sensor[J]. IEEE Transactions on Multimedia, 2013, 15(5): 1110-1120.
[19]	MUELLER F, BERNARD F, SOTNYCHENKO O, et al. GANerated hands for real-time 3D hand tracking from monocular RGB[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 49-59.
[20]	DIBRA E, WOLF T, OZTIRELI C, et al. How to refine 3D hand pose estimation from unlabelled depth data?[C]//2017 International Conference on 3D Vision (3DV). New York: IEEE Press, 2017: 135-144.
[21]	LI M C, AN L, ZHANG H W, et al. Interacting attention graph for single image two-hand reconstruction[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2751-2760.
[22]	PARK J, JUNG D S, MOON G, et al. Extract-and-adaptation network for 3D interacting hand mesh recovery[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 4202-4211.
[23]	ESHRATIFAR A E, ESMAILI A, PEDRAM M. BottleNet: a deep learning architecture for intelligent mobile cloud computing services[C]// 2019 IEEE/ACM International Symposium on Low Power Electronics and Design. New York: IEEE Press, 2019: 1-6.
[24]	LIN F Q, WILHELM C, MARTINEZ T. Two-hand global 3D pose estimation using monocular RGB[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 2372-2380.
[25]	HUANG H B, ZHOU X Q, CAO J, et al. Vision transformer with super token sampling[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 22690-22699.
[26]	MOON G, YU S, WEN H, et al. InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image[EB/OL]. [2024-06-07]. https://dblp.org/rec/journals/corr/abs-2008-09309.html.
[27]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// The 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[28]	TZIONAS D, BALLAN L, SRIKANTHA A, et al. Capturing hands in action using discriminative salient points and physics simulation[J]. International Journal of Computer Vision, 2016, 118(2): 172-193.
[29]	HAMPALI S, SARKAR S D, RAD M, et al. Keypoint transformer: solving joint identification in challenging hands and object interactions for accurate 3D pose estimation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 11080-11090.
[30]	MOON G. Bringing inputs to shared domains for 3D interacting hands recovery in the wild[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 17028-17037.