基于深度强化学习的可缓冲的物体运输和装箱

doi:10.11996/JG.j.2095-302X.2025030697

图学学报 ›› 2025, Vol. 46 ›› Issue (3): 697-708.DOI: 10.11996/JG.j.2095-302X.2025030697

基于深度强化学习的可缓冲的物体运输和装箱

雷玉林(), 刘利刚()

中国科学技术大学数学科学学院，安徽合肥 230026

收稿日期:2024-10-19 接受日期:2025-02-19 出版日期:2025-06-30 发布日期:2025-06-13
通讯作者:刘利刚(1975-)，男，教授，博士。主要研究方向为计算机图形学与CAD/CAE等。E-mail：lgliu@ustc.edu.cn
第一作者:雷玉林(2000-)，男，硕士研究生。主要研究方向为计算机图形学和深度学习。E-mail：yllei@mail.ustc.edu.cn
基金资助:
国家自然科学基金(62025207)

Transport-and-packing with buffer via deep reinforcement learning

LEI Yulin(), LIU Ligang()

School of Mathematical Sciences, University of Science and Technology of China, Hefei Anhui 230026, China

Received:2024-10-19 Accepted:2025-02-19 Published:2025-06-30 Online:2025-06-13
Contact: LIU Ligang (1975-), professor, Ph.D. His main research interests cover computer graphics and CAD/CAE, etc. E-mail：lgliu@ustc.edu.cn
First author：LEI Yulin (2000-), master student. His main research interests cover computer graphics and deep learning. E-mail：yllei@mail.ustc.edu.cn
Supported by:
National Natural Science Foundation of China(62025207)

摘要/Abstract

摘要：

针对物理场景中物体初始堆叠约束导致装箱空间利用率受限的问题，提出一种基于深度强化学习框架的可缓冲物体运输与装箱的神经优化模型，引入缓冲中转机制提升容器装箱利用率。首先，模型的状态编码器动态编码优先图提取的优先级信息和缓冲信息，有效地处理物体之间的堆叠关系和利用缓冲区的中转能力；然后，序列解码器感知当前容器状态，利用注意力机制对编码后的特征向量计算候选旋转状态序列的选取概率，自适应地选取执行中转或装箱的状态序列；接着，目标解码器将选取状态的几何信息和缓冲信息作为输入，融合序列解码器累积信息构建条件嵌入向量，对编码后的特征向量进行注意力汇聚，高效决策物体进行缓冲或装箱。最后使用带基线的REINFORCE算法训练网络得到可缓冲物体装箱的优化策略。在二维和三维RAND数据集上的实验结果表明，相较于先进的TAP-Net模型，容器装箱利用率提高了4%左右，并且明显优于针对此新定义问题设计的启发式方法。此外，基于固定数量物体训练的模型能够有效泛化到更大规模物体数量的装箱实例。

关键词: 装箱问题, 深度强化学习, 神经优化, 组合优化, 注意力机制

Abstract:

Addressing the challenge of limited container space utilization caused by initial object stacking constraints in physical scenarios, a neural optimization model based on a deep reinforcement learning framework was proposed for bufferable object transportation and packing, incorporating a buffer transfer mechanism to enhance container packing efficiency. The state encoder dynamically encoded priority information extracted from a priority graph and buffer information, effectively managed object stacking relationships, and leveraged the transfer capacity of the buffer zone. The sequence decoder perceived the current container state and employed an attention mechanism to calculate selection probabilities for candidate rotation state sequences, adaptively selecting sequences for either transfer or packing. Subsequently, the target decoder took the geometric and buffer information of the selected states as input, integrated the accumulated information from the sequence decoder to construct a conditional query vector, and performed attention aggregation on the encoded feature vectors to efficiently decide whether to buffer or pack objects. The REINFORCE algorithm with a baseline was employed to train the network, yielding optimized strategies for bufferable object packing. Experimental results on 2D and 3D RAND datasets demonstrated an approximate 4% improvement in container packing utilization compared to the advanced TAP-Net model, significantly outperforming heuristic methods designed for this newly defined problem. Furthermore, models trained on a fixed number of objects effectively generalized to packing instances involving a larger number of objects.

Key words: bin packing problem, deep reinforcement learning, neural optimization, combinatorial optimization, attention mechanism

中图分类号:

雷玉林, 刘利刚. 基于深度强化学习的可缓冲的物体运输和装箱[J]. 图学学报, 2025, 46(3): 697-708.

LEI Yulin, LIU Ligang. Transport-and-packing with buffer via deep reinforcement learning[J]. Journal of Graphics, 2025, 46(3): 697-708.

图/表 18

图1 可缓冲的三维物体装箱场景((a)初始堆叠物体；(b)缓冲区；(c)码垛机械臂；(d)目标容器)

Fig. 1 A bufferable 3D object packing scenario ((a) Initially stacked objects; (b) A buffer zone; (c) A stacking robotic arm; (d) A target container)

图2 三维物体的6种候选旋转状态

Fig. 2 Six candidate rotation states of a 3D object

图3 堆叠物体之间的移动访问阻挡和侧边访问阻挡((a)初始堆叠物件；(b)移动访问阻挡；(c)负向侧边访问阻挡；(d)正向侧边访问阻挡)

Fig. 3 Moving access blocks and side access blocks between stacked objects ((a) Initially stacked objects; (b) Movement access blockage; (c) Negative side access blockage; (d) Positive side access blockage)

图4 物体堆叠状态及其优先关系图

Fig. 4 Object stacking state and its priority relationship graph

图5 物体的优先级信息编码向量(矩形的原始状态和旋转状态分别用实体和条纹进行标记)

Fig. 5 Priority information encoding vector of the object (the original state and rotation state of the rectangle are marked with solid and striped patterns, respectively)

图6 目标容器配置((a)原始高度图；(b)梯度高度图)

Fig. 6 The target container configuration ((a) Original height map; (b) Gradient height map)

图7 策略网络结构图

Fig. 7 Structure of the policy network

图8 价值网络结构图

Fig. 8 Structure of the value network

表1 各方法在RAND数据集上的性能对比

Table 1 Performance comparison of various methods on the RAND dataset

数据	方法	利用率	平均时间/ms
RAND-2D	Random	0.805 0	5.274 3
	Greedy	0.887 5	18.545 3
	TAP-Net	0.926 8	3.087 6
	Ours (B=2)	0.966 6	3.817 0
	Ours (B=4)	0.974 9	3.880 4
RAND-3D	Random	0.631 5	9.024 2
	Greedy	0.758 7	109.014 1
	TAP-Net	0.806 3	6.459 8
	Ours (B=2)	0.842 0	7.812 9
	Ours (B=4)	0.845 9	7.749 9

表2 各方法在PPSG数据集上的性能对比

Table 2 Performance comparison of various methods on the PPSG dataset

数据	方法	利用率	平均时间/ms
PPSG-2D	Random	0.813 0	5.376 9
	Greedy	0.906 6	16.752 4
	TAP-Net	0.945 2	3.066 6
	Ours (B=2)	0.985 4	3.867 2
	Ours (B=4)	0.994 7	3.865 9
PPSG-3D	Random	0.627 3	10.154 0
	Greedy	0.785 8	109.299 0
	TAP-Net	0.827 8	6.199 4
	Ours (B=2)	0.849 0	7.806 6
	Ours (B=4)	0.847 4	7.871 1

图9 本文网络和TAP-Net训练过程的奖励函数对比((a)使用二维RAND数据集；(b)使用三维RAND数据集；(c)使用二维PPSG数据集；(d)使用三维PPSG数据集)

Fig. 9 Comparison of the reward functions during the training processes of the proposed network and TAP-Net ((a) Using the 2D RAND dataset; (b) Using the 3D RAND dataset; (c) Using the 2D PPSG dataset; (d) Using the 3D PPSG dataset)

图10 各方法在二维RAND数据集上的可视化装箱结果

Fig. 10 Visualized packing results on the 2D RAND dataset

图11 各方法在二维PPSG数据集上的可视化装箱结果

Fig. 11 Visualized packing results on the 2D PPSGdataset

图12 各方法在三维数据集上的可视化装箱结果((a)使用RAND数据集；(b)使用PPSG数据集)

Fig. 12 Visualized packing results on the 3D dataset ((a) Using the RAND dataset; (b) Using the PPSG dataset)

图13 使用不同分辨率训练后网络的评估结果

Fig. 13 Evaluation results of the network trained with different resolutions

表3 训练的网络在含有更多物体的数据集中的性能比较

Table 3 Performance comparison of the trained networks on datasets containing more objects

物体数量	方法	利用率
30	Greedy	0.896 2
30	Ours	0.981 0
40	Greedy	0.899 9
40	Ours	0.984 1
50	Greedy	0.902 4
50	Ours	0.985 8
60	Greedy	0.903 7
60	Ours	0.986 8

表4 网络使用静态信息和动态信息作为输入的装箱结果

Table 4 Packing results of the network using static and dynamic information as inputs

方法	利用率
Ours (B=2)+静态信息	0.956 9
Ours (B=2)+动态信息	0.966 6
Ours (B=4)+静态信息	0.963 1
Ours (B=4)+动态信息	0.974 9

表5 使用高度图编码信息前后网络的性能比较

Table 5 Comparison of network performance before and after using height map encoding information

方法	利用率
Ours (B=2)+仅几何信息	0.899 4
Ours (B=2)+添加高度图	0.966 6
Ours (B=4)+仅几何信息	0.912 5
Ours (B=4)+添加高度图	0.974 9

参考文献 24

[1]	FENG B, LI Y Z, SHEN Z J M. Air cargo operations: literature review and comparison with practices[J]. Transportation Research Part C: Emerging Technologies, 2015, 56: 263-280.
[2]	YOU S J, JI S H. Design of a multi-robot bin packing system in an automatic warehouse[C]// The 11th International Conference on Informatics in Control, Automation and Robotics. New York: IEEE Press, 2014: 533-538.
[3]	SILVA E F, TOFFOLO T A M, WAUTERS T. Exact methods for three-dimensional cutting and packing: a comparative study concerning single container problems[J]. Computers & Operations Research, 2019, 109: 12-27.
[4]	LODI A, MARTELLO S, VIGO D. Heuristic algorithms for the three-dimensional bin packing problem[J]. European Journal of Operational Research, 2002, 141(2): 410-420.
[5]	FAROE O, PISINGER D, ZACHARIASEN M. Guided local search for the three-dimensional bin-packing problem[J]. INFORMS Journal on Computing, 2003, 15(3): 267-283.
[6]	CRAINIC T G, PERBOLI G, TADEI R. Extreme point-based heuristics for three-dimensional bin packing[J]. INFORMS Journal on Computing, 2008, 20(3): 368-384.
[7]	FANSLAU T, BORTFELDT A. A tree search algorithm for solving the container loading problem[J]. INFORMS Journal on Computing, 2010, 22(2): 222-235.
[8]	易向阳, 潘卫平, 张俊晖. 基于五块模式的单一矩形件排样算法[J]. 图学学报, 2015, 36(4): 521-525.
	YI X Y, PAN W P, ZHANG J H. Algorithm for generating five block mode cutting patterns of single rectangular items[J]. Journal of Graphics, 2015, 36(4): 521-525 (in Chinese).
[9]	KANG K, MOON I, WANG H F. A hybrid genetic algorithm with a new packing strategy for the three-dimensional bin packing problem[J]. Applied Mathematics and Computation, 2012, 219(3): 1287-1299.
[10]	王金敏, 王保春, 朱艳华. 求解矩形布局问题的自适应算法[J]. 图学学报, 2012, 33(3): 29-33.
	WANG J M, WANG B C, ZHU Y H. An adaptive algorithm for rectangular packing problems[J]. Journal of Graphics, 2012, 33(3): 29-33 (in Chinese).
[11]	VINYALS O, FORTUNATO M, JAITLY N. Pointer networks[C]// The 29th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 2692-2700.
[12]	BELLO I, PHAM H, LE Q V, et al. Neural combinatorial optimization with reinforcement learning[EB/OL]. [2024-10-18]https://arxiv.org/abs/1611.09940.
[13]	NAZARI M, OROOJLOOY A, TAKÁC M, et al. Reinforcement learning for solving the vehicle routing problem[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 9861-9871.
[14]	HU H Y, ZHANG X D, YAN X W, et al. Solving a new 3D bin packing problem with deep reinforcement learning method[EB/OL]. [2024-10-18]https://arxiv.org/abs/1708.05930.
[15]	DUAN L, HU H Y, QIAN Y, et al. A multi-task selected learning approach for solving 3D flexible bin packing problem[EB/OL]. [2024-10-18]https://arxiv.org/abs/1804.06896.
[16]	LATERRE A, FU Y G, JABRI M K, et al. Ranked reward: enabling self-play reinforcement learning for combinatorial optimization[EB/OL]. [2024-10-18]https://arxiv.org/abs/1807.01672.
[17]	LI D D, GU Z Q, WANG Y X, et al. One model packs thousands of items with recurrent conditional query learning[J]. Knowledge-Based Systems, 2022, 235: 107683.
[18]	ZHAO H, SHE Q J, ZHU C Y, et al. Online 3D bin packing with constrained deep reinforcement learning[C]// The 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 741-749.
[19]	朱鹏辉, 袁宏涛, 聂勇伟, 等. AC-HAPE3D: 基于强化学习的异形填充算法[J]. 图学学报, 2022, 43(6): 1096-1103.
	ZHU P H, YUAN H T, NIE Y W, et al. AC-HAPE3D: an algorithm for irregular packing based on reinforcement learning[J]. Journal of Graphics, 2022, 43(6): 1096-1103 (in Chinese).
[20]	HU R Z, XU J Z, CHEN B, et al. TAP-Net: transport-and-pack using reinforcement learning[J]. ACM Transactions on Graphics, 2020, 39(6): 232.
[21]	XU J Z, GONG M L, ZHANG H, et al. Neural packing: from visual sensing to reinforcement learning[J]. ACM Transactions on Graphics, 2023, 42(6): 267.
[22]	RAMOS A G, OLIVEIRA J F, GONÇALVES J F, et al. A container loading algorithm with static mechanical equilibrium stability constraints[J]. Transportation Research Part B: Methodological, 2016, 91: 565-581.
[23]	BUSONIU L, BABUSKA R, DE SCHUTTER B. A comprehensive survey of multiagent reinforcement learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(2): 156-172.
[24]	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256.

基于深度强化学习的可缓冲的物体运输和装箱

Transport-and-packing with buffer via deep reinforcement learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 24

相关文章 15

编辑推荐

Metrics

本文评价

[1]	牛杭, 葛鑫雨, 赵晓瑜, 杨珂, 王乾铭, 翟永杰. 基于改进YOLOv8的防振锤缺陷目标检测算法[J]. 图学学报, 2025, 46(3): 532-541.
[2]	于冰, 程广, 黄东晋, 丁友东. 基于双流网络融合的三维人体网格重建[J]. 图学学报, 2025, 46(3): 625-634.
[3]	张立立, 杨康, 张珂, 魏薇, 李晶, 谭洪鑫, 张翔宇. 面向柴油车辆排放黑烟的改进型YOLOv8检测算法研究[J]. 图学学报, 2025, 46(2): 249-258.
[4]	崔克彬, 耿佳昌. 基于EE-YOLOv8s的多场景火灾迹象检测算法[J]. 图学学报, 2025, 46(1): 13-27.
[5]	陈冠豪, 徐丹, 贺康建, 施洪贞, 张浩. 基于转置注意力和CNN的图像超分辨率重建网络[J]. 图学学报, 2025, 46(1): 35-46.
[6]	张文祥, 王夏黎, 王欣仪, 杨宗宝. 一种强化伪造区域关注的深度伪造人脸检测方法[J]. 图学学报, 2025, 46(1): 47-58.
[7]	苑朝, 赵明雪, 张丰羿, 冯晓勇, 李冰, 陈瑞. 基于点云特征增强的复杂室内场景3D目标检测[J]. 图学学报, 2025, 46(1): 59-69.
[8]	卢洋, 陈林慧, 姜晓恒, 徐明亮. SDENet：基于多尺度注意力质量感知的合成缺陷数据评价网络[J]. 图学学报, 2025, 46(1): 94-103.
[9]	徐沛, 黄凯奇. 大模型引导的高效强化学习方法[J]. 图学学报, 2024, 45(6): 1165-1177.
[10]	胡凤阔, 叶兰, 谭显峰, 张钦展, 胡志新, 方清, 王磊, 满孝锋. 一种基于改进YOLOv8的轻量化路面病害检测算法[J]. 图学学报, 2024, 45(5): 892-900.
[11]	刘义艳, 郝婷楠, 贺晨, 常英杰. 基于DBBR-YOLO的光伏电池表面缺陷检测[J]. 图学学报, 2024, 45(5): 913-921.
[12]	吴沛宸, 袁立宁, 胡皓, 刘钊, 郭放. 基于注意力特征融合的视频异常行为检测[J]. 图学学报, 2024, 45(5): 922-929.
[13]	刘丽, 张起凡, 白宇昂, 黄凯烨. 结合Swin Transformer的多尺度遥感图像变化检测研究[J]. 图学学报, 2024, 45(5): 941-956.
[14]	章东平, 魏杨悦, 何数技, 徐云超, 胡海苗, 黄文君. 特征融合与层间传递：一种基于Anchor DETR改进的目标检测方法[J]. 图学学报, 2024, 45(5): 968-978.
[15]	谢国波, 林松泽, 林志毅, 吴陈锋, 梁立辉. 基于改进YOLOv7-tiny的道路病害检测算法[J]. 图学学报, 2024, 45(5): 987-997.