图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1165-1177.DOI: 10.11996/JG.j.2095-302X.2024061165
收稿日期:
2024-08-09
接受日期:
2024-10-29
出版日期:
2024-12-31
发布日期:
2024-12-24
通讯作者:
黄凯奇(1977-),男,研究员,博士。主要研究方向为计算机视觉、模式识别和博弈决策。E-mail:kqhuang@nlpr.ia.ac.cn第一作者:
徐沛(1993-),男,助理研究员,博士。主要研究方向为强化学习、多智能体学习。E-mail:pei.xu@ia.ac.cn
基金资助:
Received:
2024-08-09
Accepted:
2024-10-29
Published:
2024-12-31
Online:
2024-12-24
Contact:
HUANG Kaiqi (1977-), researcher, Ph.D. His main research interests cover computer vision, pattern recognition, game theory. E-mail:kqhuang@nlpr.ia.ac.cnFirst author:
XU Pei (1993-), assistant researcher, Ph.D. His main research interests cover reinforcement learning, multi-agent learning. E-mail:pei.xu@ia.ac.cn
Supported by:
摘要:
深度强化学习作为支撑AlphaGo和ChatGPT等突破性工作的关键技术,已成为前沿科学的研究热点。在实际应用上,深度强化学习作为一种重要的智能决策技术,被广泛应用于视觉场景的避障、虚拟场景的优化生成、机器臂控制、数字化设计与制造、工业设计决策等多种规划决策任务。然而,深度强化学习在实际应用中面临样本效率低下的挑战,严重限制了其应用效果。为缓解这一问题,针对现有强化学习探索机制的不足,将大模型技术与多种主流探索技术相结合,提出了一种基于大模型引导的高效探索方法,以提升样本效率。通过利用大模型来指导深度强化学习智能体的探索行为,该方法在多个国际公认的测试环境中显示出显著的性能提升,不仅展示了大模型技术在深度强化学习探索问题中的潜力,也为实际应用中改善样本效率提供了新的解决思路。
中图分类号:
徐沛, 黄凯奇. 大模型引导的高效强化学习方法[J]. 图学学报, 2024, 45(6): 1165-1177.
XU Pei, HUANG Kaiqi. An efficient reinforcement learning method based on large language model[J]. Journal of Graphics, 2024, 45(6): 1165-1177.
图1 强化学习中智能体的交互范式示意图((a)标准强化学习范式;(b)结合了单策略探索方法后的交互范式;(c)结合了种群探索方法后的交互范式)
Fig. 1 Schematics of the interaction paradigm for reinforcement learning agents ((a) The standard paradigm; (b) The interaction paradigm after combing a single-policy exploration method; (c) The interaction paradigm after combing a population-based exploration approach)
图8 各方法在MiniGrid环境中的任务性能曲线图
Fig. 8 Performance of each method in MiniGrid ((a) Easy: MultiRoom-N7-S8; (b) Easy: MultiRoom-N12-S10; (c) Medium: KeyCorridorS6R3; (d) Medium: ObstrictedMaze-2Dlh; (e) Hard: ObstrictedMaze-1Q; (f) Hard: ObstrictedMaze-Full)
MiniGrid环境 任务名称 | 方法 | |||
---|---|---|---|---|
RND | S-RND | NovelD | S-NovelD | |
MultiRoom-N12-S10 | 2.65 | 3.87 | 4.05 | 4.08 |
KeyCorridorS6R3 | 2.89 | 3.45 | 3.72 | 3.91 |
ObstructedMaze-1Q | 0.91 | 1.35 | 2.25 | 2.39 |
表1 各方法在无任务奖励设定下的状态访问熵
Table 1 State visit entropy for each method under the reward-free setting
MiniGrid环境 任务名称 | 方法 | |||
---|---|---|---|---|
RND | S-RND | NovelD | S-NovelD | |
MultiRoom-N12-S10 | 2.65 | 3.87 | 4.05 | 4.08 |
KeyCorridorS6R3 | 2.89 | 3.45 | 3.72 | 3.91 |
ObstructedMaze-1Q | 0.91 | 1.35 | 2.25 | 2.39 |
MPE环境 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Push_Box | 130.2±10.7 | 156.2±8.5 (+20%) |
Pass | 263.7±9.9 | 300.6±10.3 (+14%) |
Secret_Room | 148.9±3.2 | 171.2±4.5 (+15%) |
Room | 565.1±7.2 | 593.5±6.4 (+5%) |
表2 各方法在无任务奖励设定下探索的状态数量/K
Table 2 Number of Explored States for each method under the reward-free setting/K
MPE环境 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Push_Box | 130.2±10.7 | 156.2±8.5 (+20%) |
Pass | 263.7±9.9 | 300.6±10.3 (+14%) |
Secret_Room | 148.9±3.2 | 171.2±4.5 (+15%) |
Room | 565.1±7.2 | 593.5±6.4 (+5%) |
VizDoom 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Maze-S20 | 95±1 | 96±0 |
Maze-S50 | 84±3 | 87±4 |
Maze-S80 | 71±2 | 80±4 |
表3 各方法在无任务奖励设定下状态空间的覆盖率/%
Table 3 Coverage of the state space for each method under the reward-free setting/%
VizDoom 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Maze-S20 | 95±1 | 96±0 |
Maze-S50 | 84±3 | 87±4 |
Maze-S80 | 71±2 | 80±4 |
VizDoom 任务名称 | S-RND | S-MAE | ||
---|---|---|---|---|
Oracle | 大模型 | Oracle | 大模型 | |
Maze-S20 | 96±0 | 95±1 | 96±0 | 94±4 |
Maze-S50 | 81±1 | 79±3 | 87±4 | 86±8 |
Maze-S80 | 67±1 | 65±3 | 80±4 | 75±7 |
表4 采用不同方式提取状态语义对性能的影响/%
Table 4 Impact of using different ways to extract state semantics on performance/%
VizDoom 任务名称 | S-RND | S-MAE | ||
---|---|---|---|---|
Oracle | 大模型 | Oracle | 大模型 | |
Maze-S20 | 96±0 | 95±1 | 96±0 | 94±4 |
Maze-S50 | 81±1 | 79±3 | 87±4 | 86±8 |
Maze-S80 | 67±1 | 65±3 | 80±4 | 75±7 |
[1] | 赵静, 裴子楠, 姜斌, 等. 基于深度强化学习的无人机虚拟管道视觉避障[J]. 自动化学报, 2024, 50(11): 1-14. |
ZHAO J, PEI Z N, JIANG B, et al. Virtual tube visual obstacle avoidance for UAV based on deep reinforcement learning[J]. Acta Automatica Sinica, 2024, 50(11): 1-14. (in Chinese) | |
[2] |
邢晓月, 陶秀挺, 王姝钫, 等. 面向喷头空行程长度最小化的多零件优化布局方法[J]. 图学学报, 2023, 44(6): 1239-1250.
DOI |
XING X Y, TAO X T, WANG S F, et al. Optimizing the multi-part layout to minimize the empty travel distance of nozzle[J]. Journal of Graphics, 2023, 44(6): 1239-1250. (in Chinese) | |
[3] | 朱蕾, 蒋福临, 李琳. 预测网络辅助下徽派村落街坊布局生成方法[J]. 图学学报, 2022, 43(5): 909-917. |
ZHU L, JIANG F L, LI L. Generation method of neighborhoods layout in Hui-style villages with the aid of prediction network[J]. Journal of Graphics, 2022, 43(5): 909-917. (in Chinese) | |
[4] | 朱鹏辉, 袁宏涛, 聂勇伟, 等. AC-HAPE3D: 基于强化学习的异形填充算法[J]. 图学学报, 2022, 43(6): 1096-1103. |
ZHU P H, YUAN H T, NIE Y W, et al. AC-HAPE3D: an algorithm for irregular packing based on reinforcement learning[J]. Journal of Graphics, 2022, 43(6): 1096-1103. (in Chinese) | |
[5] | 潘振华, 夏元清, 鲍泓, 等. 无人智能集群系统决策与控制研究进展[J]. 中国图象图形学报, 2024, 29(11): 3195-3215. |
PAN Z H, XIA Y Q, BAO H, et al. Research progress in decision-making for unmanned intelligent swarm system and control[J]. Journal of Image and Graphics, 2024, 29(11): 3195-3215. (in Chinese) | |
[6] | 张晨, 蒋文英, 陈思源, 等. 基于双层DQN的多智能体路径规划[J]. 中国图象图形学报, 2023, 28(7): 2167-2181. |
ZHANG C, JIANG W Y, CHEN S Y, et al. Multi-agent path planning based on improved double DQN[J]. Journal of Image and Graphics, 2023, 28(7): 2167-2181. (in Chinese) | |
[7] |
王刚锋, 张寰, 刘思濛, 等. 基于语义工艺知识的驱动桥装配序列规划研究[J]. 图学学报, 2024, 45(3): 564-574.
DOI |
WANG G F, ZHANG H, LIU S M, et al. Research on drive axle assembly sequence planning based on semantic process knowledge[J]. Journal of Graphics, 2024, 45(3): 564-574. (in Chinese)
DOI |
|
[8] | 李凯文, 张涛, 王锐, 等. 基于深度强化学习的组合优化研究进展[J]. 自动化学报, 2021, 47(11): 2521-2537. |
LI K W, ZHANG T, WANG R, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning[J]. Acta Automatica Sinica, 2021, 47(11): 2521-2537. (in Chinese) | |
[9] | 刘佳, 张晶晶, 杨胜强, 等. 百叶轮抛磨叶片微结构区域识别及路径拼接方法研究[J]. 图学学报, 2022, 43(4): 715-720. |
LIU J, ZHANG J J, YANG S Q, et al. Research on microstructure region identification and path splicing method of abrasive cloth wheel polishing blade[J]. Journal of Graphics, 2022, 43(4): 715-720. (in Chinese) | |
[10] | 杨延璞, 雷紫荆, 兰晨昕, 等. 融合贝叶斯网络与前景理论的产品工业设计多阶段决策方法[J]. 图学学报, 2022, 43(3): 537-547. |
YANG Y P, LEI Z J, LAN C X, et al. Multistage decision-making method of product industrial design by integrating Bayesian network and prospect theory[J]. Journal of Graphics, 2022, 43(3): 537-547. (in Chinese) | |
[11] | 丁进良, 杨翠娥, 陈远东, 等. 复杂工业过程智能优化决策系统的现状与展望[J]. 自动化学报, 2018, 44(11): 1931-1943. |
DING J L, YANG C E, CHEN Y D, et al. Research Progress and Prospects of Intelligent Optimization Decision Making in Complex Industrial Process[J]. Acta Automatica Sinica, 2018, 44(11): 1931-1943. (in Chinese) | |
[12] | RACHIT D, PULKIT A, DEEPAK P, et al. Investigating human priors for playing video games[EB/OL]. [2024-06-09]. http://proceedings.mlr.press/v80/dubey18a.html. |
[13] | CHEVALIER-BOISVERT M, DAI B, TOWERS M, et al. Minigrid & miniworld: modular & customizable reinforcement learning environments for goal-oriented tasks[C]// The 37th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2023: 3209. |
[14] | HARRIES L, LEE S, RZEPECKI J, et al. MazeExplorer: a customisable 3D benchmark for assessing generalisation in reinforcement learning[C]// 2019 IEEE Conference on Games. New York: IEEE Press, 2019: 1-4. |
[15] | WANG T H, WANG J H, WU Y, et al. Influence-based multi-agent exploration[EB/OL]. [2024-06-09]. https://iclr.cc/virtual_2020/poster_BJgy96EYvr.html. |
[16] | BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 1479-1487. |
[17] | OSTROVSKI G, BELLEMARE M G, VAN DEN OORD A, et al. Count-based exploration with neural density models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3305890.3305962. |
[18] | MARTIN J, NARAYANAN S S, EVERITT T, et al. Count-based exploration in feature space for reinforcement learning[C]// The 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 2471-2478. |
[19] | TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: a study of count-based exploration for deep reinforcement learning[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2750-2759. |
[20] | HENAFF M, RAILEANU R, JIANG M Q, et al. Exploration via elliptical episodic bonuses[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2728. |
[21] | STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331. |
[22] | XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653. |
[23] | STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1507.00814.pdf. |
[24] | PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 488-489. |
[25] | BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1808.04355. |
[26] | BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1810.12894. |
[27] | LEE L, EYSENBACH B, PARISOTTO E, et al. Efficient exploration via state marginal matching[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1906.05274.pdf. |
[28] | RAILEANU R, ROCKTASCHEL T. RIDE: rewarding impact-driven exploration for procedurally generated environments[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2002.12292. |
[29] | ZHANG T J, XU H Z, WANG X L, et al. NovelD: a simple yet effective exploration criterion[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1931. |
[30] | HAZAN E, KAKADE S, SINGH K, et al. Provably efficient maximum entropy exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1812.02690. |
[31] | ZHANG C H, CAI Y Y, HUANG L B, et al. Exploration by maximizing Rényi entropy for reward-free RL framework[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2006.06193. |
[32] | MUTTI M, PRATISSOLI L, RESTELLI M. A policy gradient method for task-agnostic exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2007.04640v1. |
[33] | LIU H, ABBEEL P. Behavior from the void: unsupervised active pre-training[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1411. |
[34] | ZHANG T J, RASHIDINEJAD P, JIAO J T, et al. MADE: exploration via maximizing deviation from explored regions[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 740. |
[35] | XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 11717-11725. |
[36] | BELLEMARE M G, VENESS J, TALVITIE E. Skip context tree switching[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045055. |
[37] | BELLEMARE M G, NADDAF Y, VENESS J, et al. The arcade learning environment: an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253-279. |
[38] | MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. |
[39] | VAN DEN OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.5555/3045390.3045575. |
[40] | CHARIKAR M S. Similarity estimation techniques from rounding algorithms[C]// The 34th Annual ACM Symposium on Theory of Computing. New York: ACM, 2002: 380-388. |
[41] | AGARWAL A, KAKADE S, HENAFF M, et al. PC-PG: policy cover directed exploration for provable policy gradient learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1124. |
[42] | REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045035. |
[43] | MELEKHOV I, KANNALA J, RAHTU E. Siamese network features for image matching[C]// The 23rd International Conference on Pattern Recognition. New York: IEEE Press, 2016: 378-383. |
[44] | KHADKA S, KAGAN T. Evolution-guided policy gradient in reinforcement learning[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1196-1208. |
[45] | SEO Y, CHEN L L, SHIN J, et al. State entropy maximization with random encoders for efficient exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2102.09430. |
[46] | KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative evolutionary reinforcement learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1905.00976. |
[47] | HAO J Y, LI P Y, TANG H Y, et al. ERL-Re2: efficient evolutionary reinforcement learning with shared state representation and individual policy representation[EB/OL]. [2024-06-09]. https://openreview.net/forum?id=FYZCHEtt6H0. |
[48] | PARKER-HOLDER J, PACCHIANO A, CHOROMANSKI K, et al. Effective diversity in population based reinforcement learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1515. |
[49] | EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity is all you need: learning skills without a reward function[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1802.06070. |
[50] | XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.24963/ijcai.2023/37. |
[51] | Xu P, Zhang J G, Huang K Q. Population-based diverse exploration for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://www.ijcai.org/proceedings/2024/32. |
[52] | TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2307.09288.pdf. |
[53] | LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6382-6393. |
[54] | KEMPKA M, WYDMUCH M, RUNC G, et al. VIZDoom: a doom-based AI research platform for visual reinforcement learning[C]// 2016 IEEE Conference on Computational Intelligence and Games. New York: IEEE Press, 2016: 1-8. |
[1] | 陈晓皎, 束云峰, 汪睿涵, 周佳欢, 陈为. 大语言模型驱动的UI评估系统[J]. 图学学报, 2024, 45(6): 1178-1187. |
[2] | 于晗, 陈治源, 熊熙瑞, 戴原星, 蔡鸿明. 基于检索增强大语言模型的MBSE智能设计方法[J]. 图学学报, 2024, 45(6): 1188-1199. |
[3] | 许璟琳, 彭阳, 欧金武, 谈骏杰, 舒江鹏, 余芳强. 融合大模型和数字孪生的公共建筑智慧运维系统[J]. 图学学报, 2024, 45(6): 1200-1206. |
[4] | 蒋灿, 郑哲, 梁雄, 林佳瑞, 马智亮, 陆新征. 大语言模型驱动的交互式建筑设计新范式——基于Rhino7的概念验证[J]. 图学学报, 2024, 45(3): 594-600. |
[5] | 伍一鹤 , 张振宁 , 仇 栋 , 李蔚清 , 苏智勇 . 基于深度强化学习的虚拟手自适应抓取研究[J]. 图学学报, 2021, 42(3): 462-469. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||