大模型引导的高效强化学习方法

doi:10.11996/JG.j.2095-302X.2024061165

图学学报 ›› 2024, Vol. 45 ›› Issue (6): 1165-1177.DOI: 10.11996/JG.j.2095-302X.2024061165

• “大模型与图学技术及应用”专题 • 上一篇下一篇

大模型引导的高效强化学习方法

徐沛¹(), 黄凯奇¹^,²^,³()

1.中国科学院自动化研究所智能系统与工程研究中心，北京 100190
2.中国科学院脑科学与智能技术卓越创新中心，上海 200031
3.中国科学院大学人工智能学院，北京 100049

收稿日期:2024-08-09 接受日期:2024-10-29 出版日期:2024-12-31 发布日期:2024-12-24
通讯作者:黄凯奇(1977-)，男，研究员，博士。主要研究方向为计算机视觉、模式识别和博弈决策。E-mail：kqhuang@nlpr.ia.ac.cn
第一作者:徐沛(1993-)，男，助理研究员，博士。主要研究方向为强化学习、多智能体学习。E-mail：pei.xu@ia.ac.cn
基金资助:
新一代人工智能国家科技重大专项(2022ZD0116403);国家资助博士后研究人员计划项目(GZC20232995);中国科学院战略性先导科技专项资助项目(XDA27010201)

An efficient reinforcement learning method based on large language model

XU Pei¹(), HUANG Kaiqi¹^,²^,³()

1. Center for Research on Intelligent System and Engineering, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
2. Chinese Academy of Sciences Center for Excellence in Brain Science and Intelligence Technology, Shanghai 200031, China
3. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

Received:2024-08-09 Accepted:2024-10-29 Published:2024-12-31 Online:2024-12-24
Contact: HUANG Kaiqi (1977-), researcher, Ph.D. His main research interests cover computer vision, pattern recognition, game theory. E-mail：kqhuang@nlpr.ia.ac.cn
First author：XU Pei (1993-), assistant researcher, Ph.D. His main research interests cover reinforcement learning, multi-agent learning. E-mail：pei.xu@ia.ac.cn
Supported by:
National Science and Technology Major Project(2022ZD0116403);Postdoctoral Fellowship Program of CPSF(GZC20232995);Strategic Priority Research Program of Chinese Academy of Sciences(XDA27010201)

摘要/Abstract

摘要：

深度强化学习作为支撑AlphaGo和ChatGPT等突破性工作的关键技术，已成为前沿科学的研究热点。在实际应用上，深度强化学习作为一种重要的智能决策技术，被广泛应用于视觉场景的避障、虚拟场景的优化生成、机器臂控制、数字化设计与制造、工业设计决策等多种规划决策任务。然而，深度强化学习在实际应用中面临样本效率低下的挑战，严重限制了其应用效果。为缓解这一问题，针对现有强化学习探索机制的不足，将大模型技术与多种主流探索技术相结合，提出了一种基于大模型引导的高效探索方法，以提升样本效率。通过利用大模型来指导深度强化学习智能体的探索行为，该方法在多个国际公认的测试环境中显示出显著的性能提升，不仅展示了大模型技术在深度强化学习探索问题中的潜力，也为实际应用中改善样本效率提供了新的解决思路。

关键词: 深度强化学习, 大语言模型, 高效探索

Abstract:

Deep reinforcement learning, as a key technology supporting breakthrough works such as AlphaGo and ChatGPT, has become a research hotspot in frontier science. In practical applications, deep reinforcement learning, as an important intelligent decision-making technology, is widely used in a variety of planning and decision-making tasks, such as obstacle avoidance in visual scenes, optimal generation of virtual scenes, robotic arm control, digital design and manufacturing, and industrial design decision-making. However, deep reinforcement learning faces the challenge of low sample efficiency in practical applications, which greatly limits its application effectiveness. In order to improve the sample efficiency, this paper proposes an efficient exploration method based on large model guidance, which combines the large model with the mainstream exploration techniques. Specifically, we utilize the semantic extraction capability of a large language model to obtain semantic information of states, which is then used to guide the exploration behavior of agents. Then, we introduce the semantic information into the classical methods in single-policy exploration and population exploration, respectively. By using the large model to guide the exploration behavior of deep reinforcement learning agents, our method shows significant performance improvement in popular environments. This research not only demonstrates the potential of large model techniques in deep reinforcement learning exploration problems, but also provides a new idea to alleviate the low sample efficiency problem in practical applications.

Key words: deep reinforcement learning, large language model, efficient exploration

中图分类号:

TP391
TP18

徐沛, 黄凯奇. 大模型引导的高效强化学习方法[J]. 图学学报, 2024, 45(6): 1165-1177.

XU Pei, HUANG Kaiqi. An efficient reinforcement learning method based on large language model[J]. Journal of Graphics, 2024, 45(6): 1165-1177.

图/表 16

图1 强化学习中智能体的交互范式示意图((a)标准强化学习范式；(b)结合了单策略探索方法后的交互范式；(c)结合了种群探索方法后的交互范式)

Fig. 1 Schematics of the interaction paradigm for reinforcement learning agents ((a) The standard paradigm; (b) The interaction paradigm after combing a single-policy exploration method; (c) The interaction paradigm after combing a population-based exploration approach)

图2 RND方法示意图

Fig. 2 Schematics of RND method

图3 本文所提的S-RND方法的语义分支示意图

Fig. 3 Schematics of the semantic branch of the proposed S-RND method

图4 本文所提的S-MAE方法示意图

Fig. 4 Schematics of the proposed S-MAE method

图5 MiniGrid环境中的任务示意图

Fig. 5 Schematic of tasks in the MiniGrid ((a) MultiRoom; (b) KeyCorridor)

图6 MiniWorld环境中的任务画面((a)第一人称截图；(b)顶视图)

Fig. 6 Screenshot of tasks in the MiniWorld ((a) First-person screenshots; (b) Top-view maps)

图7 MiniGrid和MiniWorld中状态自然语言描述样例

Fig. 7 Examples of natural language descriptions in MiniGrid and MiniWorld

图8 各方法在MiniGrid环境中的任务性能曲线图

Fig. 8 Performance of each method in MiniGrid ((a) Easy: MultiRoom-N7-S8; (b) Easy: MultiRoom-N12-S10; (c) Medium: KeyCorridorS6R3; (d) Medium: ObstrictedMaze-2Dlh; (e) Hard: ObstrictedMaze-1Q; (f) Hard: ObstrictedMaze-Full)

图9 各方法在MiniWorld环境中的任务性能曲线图

Fig. 9 Performance of each method in MiniWorld ((a) BlockedMaze-S3; (b) Maze-S4; (c) BlockedMaze-S4)

表1 各方法在无任务奖励设定下的状态访问熵

Table 1 State visit entropy for each method under the reward-free setting

MiniGrid环境任务名称	方法
MiniGrid环境任务名称	RND	S-RND	NovelD	S-NovelD
MultiRoom-N12-S10	2.65	3.87	4.05	4.08
KeyCorridorS6R3	2.89	3.45	3.72	3.91
ObstructedMaze-1Q	0.91	1.35	2.25	2.39

图10 VizDoom环境中的任务画面

Fig. 10 Screenshot of tasks in the VizDoom

图11 MPE和VizDoom中状态自然语言描述样例

Fig. 11 Examples of natural language descriptions in MPE and VizDoom

表2 各方法在无任务奖励设定下探索的状态数量/K

Table 2 Number of Explored States for each method under the reward-free setting/K

MPE环境任务名称	方法
MPE环境任务名称	MAE	S-MAE
Push_Box	130.2±10.7	156.2±8.5 (+20%)
Pass	263.7±9.9	300.6±10.3 (+14%)
Secret_Room	148.9±3.2	171.2±4.5 (+15%)
Room	565.1±7.2	593.5±6.4 (+5%)

表3 各方法在无任务奖励设定下状态空间的覆盖率/%

Table 3 Coverage of the state space for each method under the reward-free setting/%

VizDoom 任务名称	方法
VizDoom 任务名称	MAE	S-MAE
Maze-S20	95±1	96±0
Maze-S50	84±3	87±4
Maze-S80	71±2	80±4

图12 语义分支不同输入对性能的影响

Fig. 12 Impact of Different Inputs to the Semantic Branch

表4 采用不同方式提取状态语义对性能的影响/%

Table 4 Impact of using different ways to extract state semantics on performance/%

VizDoom 任务名称	S-RND		S-MAE
VizDoom 任务名称	Oracle	大模型	Oracle	大模型
Maze-S20	96±0	95±1	96±0	94±4
Maze-S50	81±1	79±3	87±4	86±8
Maze-S80	67±1	65±3	80±4	75±7

参考文献 54

[1]	赵静, 裴子楠, 姜斌, 等. 基于深度强化学习的无人机虚拟管道视觉避障[J]. 自动化学报, 2024, 50(11): 1-14.
	ZHAO J, PEI Z N, JIANG B, et al. Virtual tube visual obstacle avoidance for UAV based on deep reinforcement learning[J]. Acta Automatica Sinica, 2024, 50(11): 1-14. (in Chinese)
[2]	邢晓月, 陶秀挺, 王姝钫, 等. 面向喷头空行程长度最小化的多零件优化布局方法[J]. 图学学报, 2023, 44(6): 1239-1250. DOI
	XING X Y, TAO X T, WANG S F, et al. Optimizing the multi-part layout to minimize the empty travel distance of nozzle[J]. Journal of Graphics, 2023, 44(6): 1239-1250. (in Chinese)
[3]	朱蕾, 蒋福临, 李琳. 预测网络辅助下徽派村落街坊布局生成方法[J]. 图学学报, 2022, 43(5): 909-917.
	ZHU L, JIANG F L, LI L. Generation method of neighborhoods layout in Hui-style villages with the aid of prediction network[J]. Journal of Graphics, 2022, 43(5): 909-917. (in Chinese)
[4]	朱鹏辉, 袁宏涛, 聂勇伟, 等. AC-HAPE3D: 基于强化学习的异形填充算法[J]. 图学学报, 2022, 43(6): 1096-1103.
	ZHU P H, YUAN H T, NIE Y W, et al. AC-HAPE3D: an algorithm for irregular packing based on reinforcement learning[J]. Journal of Graphics, 2022, 43(6): 1096-1103. (in Chinese)
[5]	潘振华, 夏元清, 鲍泓, 等. 无人智能集群系统决策与控制研究进展[J]. 中国图象图形学报, 2024, 29(11): 3195-3215.
	PAN Z H, XIA Y Q, BAO H, et al. Research progress in decision-making for unmanned intelligent swarm system and control[J]. Journal of Image and Graphics, 2024, 29(11): 3195-3215. (in Chinese)
[6]	张晨, 蒋文英, 陈思源, 等. 基于双层DQN的多智能体路径规划[J]. 中国图象图形学报, 2023, 28(7): 2167-2181.
	ZHANG C, JIANG W Y, CHEN S Y, et al. Multi-agent path planning based on improved double DQN[J]. Journal of Image and Graphics, 2023, 28(7): 2167-2181. (in Chinese)
[7]	王刚锋, 张寰, 刘思濛, 等. 基于语义工艺知识的驱动桥装配序列规划研究[J]. 图学学报, 2024, 45(3): 564-574. DOI
	WANG G F, ZHANG H, LIU S M, et al. Research on drive axle assembly sequence planning based on semantic process knowledge[J]. Journal of Graphics, 2024, 45(3): 564-574. (in Chinese) DOI
[8]	李凯文, 张涛, 王锐, 等. 基于深度强化学习的组合优化研究进展[J]. 自动化学报, 2021, 47(11): 2521-2537.
	LI K W, ZHANG T, WANG R, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning[J]. Acta Automatica Sinica, 2021, 47(11): 2521-2537. (in Chinese)
[9]	刘佳, 张晶晶, 杨胜强, 等. 百叶轮抛磨叶片微结构区域识别及路径拼接方法研究[J]. 图学学报, 2022, 43(4): 715-720.
	LIU J, ZHANG J J, YANG S Q, et al. Research on microstructure region identification and path splicing method of abrasive cloth wheel polishing blade[J]. Journal of Graphics, 2022, 43(4): 715-720. (in Chinese)
[10]	杨延璞, 雷紫荆, 兰晨昕, 等. 融合贝叶斯网络与前景理论的产品工业设计多阶段决策方法[J]. 图学学报, 2022, 43(3): 537-547.
	YANG Y P, LEI Z J, LAN C X, et al. Multistage decision-making method of product industrial design by integrating Bayesian network and prospect theory[J]. Journal of Graphics, 2022, 43(3): 537-547. (in Chinese)
[11]	丁进良, 杨翠娥, 陈远东, 等. 复杂工业过程智能优化决策系统的现状与展望[J]. 自动化学报, 2018, 44(11): 1931-1943.
	DING J L, YANG C E, CHEN Y D, et al. Research Progress and Prospects of Intelligent Optimization Decision Making in Complex Industrial Process[J]. Acta Automatica Sinica, 2018, 44(11): 1931-1943. (in Chinese)
[12]	RACHIT D, PULKIT A, DEEPAK P, et al. Investigating human priors for playing video games[EB/OL]. [2024-06-09]. http://proceedings.mlr.press/v80/dubey18a.html.
[13]	CHEVALIER-BOISVERT M, DAI B, TOWERS M, et al. Minigrid & miniworld: modular & customizable reinforcement learning environments for goal-oriented tasks[C]// The 37th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2023: 3209.
[14]	HARRIES L, LEE S, RZEPECKI J, et al. MazeExplorer: a customisable 3D benchmark for assessing generalisation in reinforcement learning[C]// 2019 IEEE Conference on Games. New York: IEEE Press, 2019: 1-4.
[15]	WANG T H, WANG J H, WU Y, et al. Influence-based multi-agent exploration[EB/OL]. [2024-06-09]. https://iclr.cc/virtual_2020/poster_BJgy96EYvr.html.
[16]	BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 1479-1487.
[17]	OSTROVSKI G, BELLEMARE M G, VAN DEN OORD A, et al. Count-based exploration with neural density models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3305890.3305962.
[18]	MARTIN J, NARAYANAN S S, EVERITT T, et al. Count-based exploration in feature space for reinforcement learning[C]// The 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 2471-2478.
[19]	TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: a study of count-based exploration for deep reinforcement learning[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2750-2759.
[20]	HENAFF M, RAILEANU R, JIANG M Q, et al. Exploration via elliptical episodic bonuses[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2728.
[21]	STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331.
[22]	XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653.
[23]	STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1507.00814.pdf.
[24]	PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 488-489.
[25]	BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1808.04355.
[26]	BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1810.12894.
[27]	LEE L, EYSENBACH B, PARISOTTO E, et al. Efficient exploration via state marginal matching[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1906.05274.pdf.
[28]	RAILEANU R, ROCKTASCHEL T. RIDE: rewarding impact-driven exploration for procedurally generated environments[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2002.12292.
[29]	ZHANG T J, XU H Z, WANG X L, et al. NovelD: a simple yet effective exploration criterion[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1931.
[30]	HAZAN E, KAKADE S, SINGH K, et al. Provably efficient maximum entropy exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1812.02690.
[31]	ZHANG C H, CAI Y Y, HUANG L B, et al. Exploration by maximizing Rényi entropy for reward-free RL framework[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2006.06193.
[32]	MUTTI M, PRATISSOLI L, RESTELLI M. A policy gradient method for task-agnostic exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2007.04640v1.
[33]	LIU H, ABBEEL P. Behavior from the void: unsupervised active pre-training[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1411.
[34]	ZHANG T J, RASHIDINEJAD P, JIAO J T, et al. MADE: exploration via maximizing deviation from explored regions[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 740.
[35]	XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 11717-11725.
[36]	BELLEMARE M G, VENESS J, TALVITIE E. Skip context tree switching[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045055.
[37]	BELLEMARE M G, NADDAF Y, VENESS J, et al. The arcade learning environment: an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253-279.
[38]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[39]	VAN DEN OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.5555/3045390.3045575.
[40]	CHARIKAR M S. Similarity estimation techniques from rounding algorithms[C]// The 34th Annual ACM Symposium on Theory of Computing. New York: ACM, 2002: 380-388.
[41]	AGARWAL A, KAKADE S, HENAFF M, et al. PC-PG: policy cover directed exploration for provable policy gradient learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1124.
[42]	REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045035.
[43]	MELEKHOV I, KANNALA J, RAHTU E. Siamese network features for image matching[C]// The 23rd International Conference on Pattern Recognition. New York: IEEE Press, 2016: 378-383.
[44]	KHADKA S, KAGAN T. Evolution-guided policy gradient in reinforcement learning[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1196-1208.
[45]	SEO Y, CHEN L L, SHIN J, et al. State entropy maximization with random encoders for efficient exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2102.09430.
[46]	KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative evolutionary reinforcement learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1905.00976.
[47]	HAO J Y, LI P Y, TANG H Y, et al. ERL-Re²: efficient evolutionary reinforcement learning with shared state representation and individual policy representation[EB/OL]. [2024-06-09]. https://openreview.net/forum?id=FYZCHEtt6H0.
[48]	PARKER-HOLDER J, PACCHIANO A, CHOROMANSKI K, et al. Effective diversity in population based reinforcement learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1515.
[49]	EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity is all you need: learning skills without a reward function[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1802.06070.
[50]	XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.24963/ijcai.2023/37.
[51]	Xu P, Zhang J G, Huang K Q. Population-based diverse exploration for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://www.ijcai.org/proceedings/2024/32.
[52]	TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2307.09288.pdf.
[53]	LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6382-6393.
[54]	KEMPKA M, WYDMUCH M, RUNC G, et al. VIZDoom: a doom-based AI research platform for visual reinforcement learning[C]// 2016 IEEE Conference on Computational Intelligence and Games. New York: IEEE Press, 2016: 1-8.

大模型引导的高效强化学习方法

An efficient reinforcement learning method based on large language model

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 54

相关文章 5

编辑推荐

Metrics

本文评价

[1]	陈晓皎, 束云峰, 汪睿涵, 周佳欢, 陈为. 大语言模型驱动的UI评估系统[J]. 图学学报, 2024, 45(6): 1178-1187.
[2]	于晗, 陈治源, 熊熙瑞, 戴原星, 蔡鸿明. 基于检索增强大语言模型的MBSE智能设计方法[J]. 图学学报, 2024, 45(6): 1188-1199.
[3]	许璟琳, 彭阳, 欧金武, 谈骏杰, 舒江鹏, 余芳强. 融合大模型和数字孪生的公共建筑智慧运维系统[J]. 图学学报, 2024, 45(6): 1200-1206.
[4]	蒋灿, 郑哲, 梁雄, 林佳瑞, 马智亮, 陆新征. 大语言模型驱动的交互式建筑设计新范式——基于Rhino7的概念验证[J]. 图学学报, 2024, 45(3): 594-600.
[5]	伍一鹤 , 张振宁 , 仇栋 , 李蔚清 , 苏智勇 . 基于深度强化学习的虚拟手自适应抓取研究[J]. 图学学报, 2021, 42(3): 462-469.