Deep learning stereo matching algorithm fusing structural information

doi:10.11996/JG.j.2095-302X.2023050899

Abstract

Abstract:

To address the limitations of existing stereo matching algorithms in both edge regions and regions of discontinuous disparity, a deep learning stereo matching algorithm fusing structural information was proposed. By limiting the convolution kernel size and replacing the BatchNorm layer and activation function layer with the Inplace-ABN layer, the efficiency of convolution to extract image features was enhanced. The local similarity pattern module combined with an attention mechanism was employed to extract image structural features, and the features extracted by convolution were fused to enrich image feature information. The correlation volume and connection volume of the output feature were calculated. By utilizing the correlation volume to generate attention weights, the algorithm filtered out the redundant information of the connection volume and improved the accuracy of the stereo matching cost volume. In order to expedite network cost aggregation, a simplified hourglass network was employed. The algorithm was tested against the Scene Flow dataset, CREStereo dataset, and KITTI dataset. The experimental results demonstrated that the algorithm had an overall region endpoint error of 0.45 pixels. In the first frame image, only 1.55% of regions were incorrectly predicted, and merely 6.87% of pixels exhibited prediction errors greater than 1 pixel. These results demonstrated the excellent performance of the proposed algorithm compared to other algorithms in terms of matching accuracy. Furthermore, it validated the effectiveness and advantages of the algorithm in matching problematic areas.

Key words: deep learning, stereo matching, structural information, local similarity pattern module, cost volume

CLC Number:

TP751.1

DANG Hong-she, XU Huai-biao, ZHANG Xuan-de. Deep learning stereo matching algorithm fusing structural information[J]. Journal of Graphics, 2023, 44(5): 899-906.

Figures/Tables 11

References 21

[1]	李云龙, 卿粼波, 韩龙玫, 等. 视觉可供性研究综述[J]. 计算机工程与应用, 2022, 58(18): 1-15. DOI
	LI Y L, QING L B, HAN L M, et al. Survey on visual affordance research[J]. Computer Engineering and Applications, 2022, 58(18): 1-15. (in Chinese) DOI
[2]	陈炎, 杨丽丽, 王振鹏. 双目视觉的匹配算法综述[J]. 图学学报, 2020, 41(5): 702-708. DOI
	CHEN Y, YANG L L, WANG Z P. Literature survey on stereo vision matching algorithms[J]. Journal of Graphics, 2020, 41(5): 702-708. (in Chinese) DOI
[3]	尹晨阳, 职恒辉, 李慧斌. 基于深度学习的双目立体匹配方法综述[J]. 计算机工程, 2022, 48(10): 1-12. DOI
	YIN C Y, ZHI H H, LI H B. Survey of binocular stereo-matching methods based on deep learning[J]. Computer Engineering, 2022, 48(10): 1-12. (in Chinese) DOI
[4]	MAYER N, ILG E, HÄUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 4040-4048.
[5]	CHANG J R, CHEN Y S. Pyramid stereo matching network[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5410-5418.
[6]	GUO X Y, YANG K, YANG W K, et al. Group-wise correlation stereo network[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 3268-3277.
[7]	XU H F, ZHANG J Y. AANet: adaptive aggregation network for efficient stereo matching[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2020: 1956-1965.
[8]	CHENG X L, ZHONG Y R, HARANDI M, et al. Hierarchical neural architecture search for deep stereo matching[C]// The 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 22158-22169.
[9]	LI J K, WANG P S, XIONG P F, et al. Practical stereo matching via cascaded recurrent network with adaptive correlation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 16242-16251.
[10]	LIPSON L, TEED Z, DENG J. RAFT-stereo: multilevel recurrent field transforms for stereo matching[C]// 2021 International Conference on 3D Vision. New York: IEEE Press, 2022: 218-227.
[11]	LIU B Y, YU H M, LONG Y Q. Local similarity pattern and cost self-reassembling for deep stereo matching networks[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(2): 1647-1655. DOI URL
[12]	NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[M]// Computer Vision - ECCV 2016. Cham: Springer International Publishing, 2016: 483-499.
[13]	李瞳, 马伟, 徐士彪, 等. 适应立体匹配任务的端到端深度网络[J]. 计算机研究与发展, 2020, 57(7): 1531-1538.
	LI T, MA W, XU S B, et al. Task-adaptive end-to-end networks for stereo matching[J]. Journal of Computer Research and Development, 2020, 57(7): 1531-1538. (in Chinese)
[14]	KOUTINI K, EGHBAL-ZADEH H, DORFER M, et al. The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification[C]// The 27th European Signal Processing Conference. New York: IEEE Press, 2019: 1-5.
[15]	TAN M, LE Q V. MixConv: mixed depthwise convolutional kernels"[EB/OL]. [2023-01-18]. https://arxiv.org/abs/1907.09595.
[16]	BULÒ S R, PORZI L, KONTSCHIEDER P. In-place activated BatchNorm for memory-optimized training of DNNs[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 5639-5647.
[17]	RIDNIK T, LAWEN H, NOY A, et al. TResNet: high performance GPU-dedicated architecture[C]// 2021 IEEE Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2021: 1399-1408.
[18]	WANG Y N, GU M J, ZHU Y F, et al. Improvement of AD-census algorithm based on stereo vision[J]. Sensors, 2022, 22(18): 6933. DOI URL
[19]	XU G W, CHENG J D, GUO P, et al. Attention concatenation volume for accurate and efficient stereo matching[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 12971-12980.
[20]	KENDALL A, MARTIROSYAN H, DASGUPTA S, et al. End-to-end learning of geometry and context for deep stereo regression[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 66-75.
[21]	GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: the KITTI dataset[J]. International Journal of Robotics Research, 2013, 32(11): 1231-1237. DOI URL

Layer	Kernal size，Channel	Output
Conv0_1	3×3，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv0_2	1×1，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv1_x	3×3，32 1×1，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv2_x+SE	3×3，64 1×1，64	$\frac{1}{4}H\times \frac{1}{4}W\times 64$
Conv3_x+SE	3×3，128 1×1，128	$\frac{1}{4}H\times \frac{1}{4}W\times 128$
Conv4_x+SE	3×3，128 1×1，128	$\frac{1}{4}H\times \frac{1}{4}W\times 128$

Layer	Kernal size，Channel	Output
Conv0_1	3×3，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv0_2	1×1，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv1_x	3×3，32 1×1，32	$\frac{1}{2}H\times \frac{1}{2}W\times 32$
Conv2_x+SE	3×3，64 1×1，64	$\frac{1}{4}H\times \frac{1}{4}W\times 64$
Conv3_x+SE	3×3，128 1×1，128	$\frac{1}{4}H\times \frac{1}{4}W\times 128$
Conv4_x+SE	3×3，128 1×1，128	$\frac{1}{4}H\times \frac{1}{4}W\times 128$

Method	EPE (px)	D1-all (%)	>1 px (%)	Time (s)
1. ResCNN	1.07	3.89	7.68	0.73
2. Improved ResCNN	1.02	3.54	7.65	0.67
3. Improved ResCNN + LSP	0.78	1.67	7.31	0.70
4. Improved ResCNN + ACV	0.51	1.95	7.03	0.60
5. Improved ResCNN + LSP + ACV	0.45	1.55	6.87	0.62

Method	EPE (px)	D1-all (%)	>1 px (%)	Time (s)
1. ResCNN	1.07	3.89	7.68	0.73
2. Improved ResCNN	1.02	3.54	7.65	0.67
3. Improved ResCNN + LSP	0.78	1.67	7.31	0.70
4. Improved ResCNN + ACV	0.51	1.95	7.03	0.60
5. Improved ResCNN + LSP + ACV	0.45	1.55	6.87	0.62

Method	KITTI2015
Method	D1-bg	D1-fg	D1-all
PSMNet^[5]	1.86	4.62	2.32
GwcNet^[6]	1.71	3.93	2.11
AANet^[7]	1.65	3.96	2.03
LEAStereo^[8]	1.40	2.91	1.65
CREStereo^[9]	1.45	2.86	1.69
ACVNet^[19]	1.37	3.07	1.65
RAFT-Stereo^[10]	1.58	3.05	1.82
ILANet	1.38	2.98	1.61