3D piece-wise planar reconstruction from a single indoor image based on self-augmented -attention mechanism

doi:10.11996/JG.j.2095-302X.2024030464

Abstract

Abstract:

The piece-wise 3D reconstruction of indoor scenes using convolutional neural networks (CNN) has become one of the hot topics in the research of indoor scene modeling. However, the intertwining of planar and non-planar elements often leads to the network’s extraction of non-planar information mixed with planar features, thereby affecting the final segmentation accuracy. Moreover, there are significant scale differences in the planes present in indoor scenes, leading to pronounced class imbalances, where small-scale plane instances are prone to distortion. To address these challenges, this paper proposed a self-enhanced attention-based multi-scale feature fusion network for 3D plane segmentation reconstruction. This network can automatically learn planar features in the scene and effectively fuse feature information from different scales, thereby enhancing the accuracy of plane instance segmentation. At the same time, by assigning different weights to each pixel in the plane instance, particularly increasing the weight values for small-scale plane edge pixels, the channel representation of small-scale plane segmentation objects was further enhanced. Finally, a new loss function was constructed using balanced cross-entropy loss and dice loss to train the model, further improving the accuracy of plane segmentation. Extensive experiments demonstrated that the algorithm proposed achieves significant improvements in plane recall rate and segmentation accuracy, resulting in more accurate indoor 3D segmented plane reconstruction models.

Key words: deep learning, segmented plane reconstruction, multi-scale fusion, enhance attention, self-attention

CLC Number:

TP183

ZHU Guanghui, MIAO Jun, HU Hongli, SHEN Ji, DU Ronghua. 3D piece-wise planar reconstruction from a single indoor image based on self-augmented -attention mechanism[J]. Journal of Graphics, 2024, 45(3): 464-471.

Figures/Tables 9

Fig. 1 Algorithm workflow

Fig. 2 Self-enhancing attention module

Fig. 3 Reconstruction results ((a) PlaneNet; (b) PlaneAE; (c) PlaneRecNet; (d) InvPT; (e) Ours)

Fig. 4 Examples of a partially segmented graph results ((a) Original image; (b) PlaneNet; (c) PlaneAE; (d) PlaneRecNet; (e) InvPT; (f) Ours)

Table 1 Comparison of depth accuracy based on NYU-V2 dataset

Method	Rel↓	Rel(sqr)↓	Log₁₀↓	RMSE_iin↓	RMSE_log↓	1.25	1.25²	1.25³
PlaneNet	0.142	0.107	0.060	0.514	0.179	81.2	95.7	98.9
PlaneAE	0.141	0.107	0.061	0.529	0.184	81.0	95.7	99.0
PlaneRecNet	0.138	0.099	0.058	0.512	0.179	82.0	95.9	99.0
InvPT	0.136	0.099	0.059	0.518	0.179	82.0	95.9	99.0
Ours	0.134	0.098	0.058	0.506	0.177	82.1	96.2	99.0

Table 2 Comparison of segmentation accuracy based on NYU-V2 dataset

Method	$AP m$	$AP m 50$	$AP m 75$	$AP b$	$AP b 50$	$AP b 75$
PlaneAE	45.40	57.62	46.26	49.52	61.77	49.83
PlaneRCNN	62.71	75.18	69.26	65.13	76.27	71.09
PlaneRecNet	63.88	78.95	73.17	64.93	79.00	69.80
InvPT	64.22	79.05	72.60	64.86	78.82	71.36
Ours	65.11	79.98	65.62	66.24	81.10	72.37

Table 2 Comparison of segmentation accuracy based on NYU-V2 dataset

Method	$AP m$	$AP m 50$	$AP m 75$	$AP b$	$AP b 50$	$AP b 75$
PlaneAE	45.40	57.62	46.26	49.52	61.77	49.83
PlaneRCNN	62.71	75.18	69.26	65.13	76.27	71.09
PlaneRecNet	63.88	78.95	73.17	64.93	79.00	69.80
InvPT	64.22	79.05	72.60	64.86	78.82	71.36
Ours	65.11	79.98	65.62	66.24	81.10	72.37

Fig. 5 ScanNet dataset ((a) Planar recall; (b) Pixel recall)

Table 3 Comparison of segmentation accuracy in ablation experiments

Method	RI	VI	SC
原网络	0.888	1.380	0.519
SAA	0.893	1.322	0.546
L_ED	0.891	1.361	0.530
SAA+L_ED	0.911	1.304	0.551

Table 4 Comparison of segmentation accuracy using different numbers of self-attention enhancement modules

Number	RI	VI	SC
1	0.890	1.378	0.530
2	0.895	1.322	0.546
3	0.911	1.304	0.551

References 24

[1]	LIU C, YANG J M, CEYLAN D, et al. PlaneNet: piece-wise planar reconstruction from a single RGB image[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 2579-2588.
[2]	YANG F T, ZHOU Z H. Recovering 3D planes from a single image via convolutional neural networks[C]// European Conference on Computer Vision. Cham: Springer, 2018: 87-103.
[3]	YE H R, XU D. Inverted pyramid multi-task transformer for dense scene understanding[C]// European Conference on Computer Vision. Cham: Springer, 2022: 514-530.
[4]	YU Z H, ZHENG J, LIAN D Z, et al. Single-image piece-wise planar 3D reconstruction via associative embedding[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 1029-1037.
[5]	LIU C, KIM K, GU J W, et al. PlaneRCNN: 3D plane detection and reconstruction from a single image[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4445-4454.
[6]	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2980-2988.
[7]	QIAN Y M, FURUKAWA Y. Learning pairwise inter-plane relations for piecewise planar reconstruction[C]// European Conference on Computer Vision. Cham: Springer, 2020: 330-345.
[8]	XI W J, CHEN X J. Reconstructing piecewise planar scenes with multi-view regularization[J]. Computational Visual Media, 2019, 5(4): 337-345.
[9]	XIE Y X, RAMBACH J, SHU F W, et al. PlaneSegNet: fast and robust plane estimation using a single-stage instance segmentation CNN[C]// 2021 IEEE International Conference on Robotics and Automation. New York: IEEE Press, 2021: 13574-13580.
[10]	TAN B, XUE N, BAI S, et al. PlaneTR: structure-guided transformers for 3D plane recovery[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 4166-4175.
[11]	ZHANG W D, ZHANG Y M, SONG R, et al. 3D layout estimation via weakly supervised learning of plane parameters from 2D segmentation[J]. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society, 2022, 31: 868-879.
[12]	LIU J C, JI P, BANSAL N, et al. PlaneMVS: 3D plane reconstruction from multi-view stereo[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 8655-8665.
[13]	江瑞祥, 缪君, 储珺, 等. 基于多尺度聚焦网络的单图像城市场景3D平面重建[J]. 小型微型计算机系统, 2022, 43(8): 1718-1724.
	JIANG R X, MIAO J, CHU J, et al. 3D planar reconstruction of urban scene from single image based on multi-scale focusing network[J]. Journal of Chinese Computer Systems, 2022, 43(8): 1718-1724 (in Chinese).
[14]	XIE Y X, SHU F W, RAMBACH J, et al. PlaneRecNet: multi-task learning with cross-task consistency for piece-wise plane detection and reconstruction from a single RGB image[EB/OL]. (2021-10-21) [2023-08-22]. http://arxiv.org/abs/2110.11219.
[15]	HERMANN K M, KOČISKÝ T, GREFENSTETTE E, et al. Teaching machines to read and comprehend[EB/OL]. (2015-11-19) [2023-07-18]. http://arxiv.org/abs/1506.03340.
[16]	GAO Z L, XIE J T, WANG Q L, et al. Global second-order pooling convolutional networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3019-3028.
[17]	LEE H, KIM H E, NAM H. SRM: a style-based recalibration module for convolutional neural networks[C]// 2019 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2019: 1854-1862.
[18]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 7132-7141.
[19]	MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[EB/OL]. (2014-06-24) [2023-06-18]. http://arxiv.org/abs/1406.6247.
[20]	HU J, SHEN L, ALBANIE S, et al. Gather-excite: exploiting feature context in convolutional neural networks[EB/OL]. (2019-01-12) [2023-07-23]. http://arxiv.org/abs/1810.12348.
[21]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[22]	WANG Z W, SHE Q, SMOLIC A. ACTION-net: multipath excitation for action recognition[C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2021: 13209-13218.
[23]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. (2017-06-12) [2023-07-13]. http://arxiv.org/abs/1706.03762.
[24]	YANG B S, WANG L Y, WONG D, et al. Convolutional self-attention networks[EB/OL]. [2023-07-13]. http://arxiv.org/abs/1904.03107