An efficient reinforcement learning method based on large language model

doi:10.11996/JG.j.2095-302X.2024061165

Abstract

Abstract:

Deep reinforcement learning, as a key technology supporting breakthrough works such as AlphaGo and ChatGPT, has become a research hotspot in frontier science. In practical applications, deep reinforcement learning, as an important intelligent decision-making technology, is widely used in a variety of planning and decision-making tasks, such as obstacle avoidance in visual scenes, optimal generation of virtual scenes, robotic arm control, digital design and manufacturing, and industrial design decision-making. However, deep reinforcement learning faces the challenge of low sample efficiency in practical applications, which greatly limits its application effectiveness. In order to improve the sample efficiency, this paper proposes an efficient exploration method based on large model guidance, which combines the large model with the mainstream exploration techniques. Specifically, we utilize the semantic extraction capability of a large language model to obtain semantic information of states, which is then used to guide the exploration behavior of agents. Then, we introduce the semantic information into the classical methods in single-policy exploration and population exploration, respectively. By using the large model to guide the exploration behavior of deep reinforcement learning agents, our method shows significant performance improvement in popular environments. This research not only demonstrates the potential of large model techniques in deep reinforcement learning exploration problems, but also provides a new idea to alleviate the low sample efficiency problem in practical applications.

Key words: deep reinforcement learning, large language model, efficient exploration

CLC Number:

TP391
TP18

XU Pei, HUANG Kaiqi. An efficient reinforcement learning method based on large language model[J]. Journal of Graphics, 2024, 45(6): 1165-1177.

Figures/Tables 16

Fig. 1 Schematics of the interaction paradigm for reinforcement learning agents ((a) The standard paradigm; (b) The interaction paradigm after combing a single-policy exploration method; (c) The interaction paradigm after combing a population-based exploration approach)

Fig. 2 Schematics of RND method

Fig. 3 Schematics of the semantic branch of the proposed S-RND method

Fig. 4 Schematics of the proposed S-MAE method

Fig. 5 Schematic of tasks in the MiniGrid ((a) MultiRoom; (b) KeyCorridor)

Fig. 6 Screenshot of tasks in the MiniWorld ((a) First-person screenshots; (b) Top-view maps)

Fig. 7 Examples of natural language descriptions in MiniGrid and MiniWorld

Fig. 8 Performance of each method in MiniGrid ((a) Easy: MultiRoom-N7-S8; (b) Easy: MultiRoom-N12-S10; (c) Medium: KeyCorridorS6R3; (d) Medium: ObstrictedMaze-2Dlh; (e) Hard: ObstrictedMaze-1Q; (f) Hard: ObstrictedMaze-Full)

Fig. 9 Performance of each method in MiniWorld ((a) BlockedMaze-S3; (b) Maze-S4; (c) BlockedMaze-S4)

Table 1 State visit entropy for each method under the reward-free setting

MiniGrid环境任务名称	方法
MiniGrid环境任务名称	RND	S-RND	NovelD	S-NovelD
MultiRoom-N12-S10	2.65	3.87	4.05	4.08
KeyCorridorS6R3	2.89	3.45	3.72	3.91
ObstructedMaze-1Q	0.91	1.35	2.25	2.39

Fig. 10 Screenshot of tasks in the VizDoom

Fig. 11 Examples of natural language descriptions in MPE and VizDoom

Table 2 Number of Explored States for each method under the reward-free setting/K

MPE环境任务名称	方法
MPE环境任务名称	MAE	S-MAE
Push_Box	130.2±10.7	156.2±8.5 (+20%)
Pass	263.7±9.9	300.6±10.3 (+14%)
Secret_Room	148.9±3.2	171.2±4.5 (+15%)
Room	565.1±7.2	593.5±6.4 (+5%)

Table 3 Coverage of the state space for each method under the reward-free setting/%

VizDoom 任务名称	方法
VizDoom 任务名称	MAE	S-MAE
Maze-S20	95±1	96±0
Maze-S50	84±3	87±4
Maze-S80	71±2	80±4

Fig. 12 Impact of Different Inputs to the Semantic Branch

Table 4 Impact of using different ways to extract state semantics on performance/%

VizDoom 任务名称	S-RND		S-MAE
VizDoom 任务名称	Oracle	大模型	Oracle	大模型
Maze-S20	96±0	95±1	96±0	94±4
Maze-S50	81±1	79±3	87±4	86±8
Maze-S80	67±1	65±3	80±4	75±7

References 54

[1]	赵静, 裴子楠, 姜斌, 等. 基于深度强化学习的无人机虚拟管道视觉避障[J]. 自动化学报, 2024, 50(11): 1-14.
	ZHAO J, PEI Z N, JIANG B, et al. Virtual tube visual obstacle avoidance for UAV based on deep reinforcement learning[J]. Acta Automatica Sinica, 2024, 50(11): 1-14. (in Chinese)
[2]	邢晓月, 陶秀挺, 王姝钫, 等. 面向喷头空行程长度最小化的多零件优化布局方法[J]. 图学学报, 2023, 44(6): 1239-1250. DOI
	XING X Y, TAO X T, WANG S F, et al. Optimizing the multi-part layout to minimize the empty travel distance of nozzle[J]. Journal of Graphics, 2023, 44(6): 1239-1250. (in Chinese)
[3]	朱蕾, 蒋福临, 李琳. 预测网络辅助下徽派村落街坊布局生成方法[J]. 图学学报, 2022, 43(5): 909-917.
	ZHU L, JIANG F L, LI L. Generation method of neighborhoods layout in Hui-style villages with the aid of prediction network[J]. Journal of Graphics, 2022, 43(5): 909-917. (in Chinese)
[4]	朱鹏辉, 袁宏涛, 聂勇伟, 等. AC-HAPE3D: 基于强化学习的异形填充算法[J]. 图学学报, 2022, 43(6): 1096-1103.
	ZHU P H, YUAN H T, NIE Y W, et al. AC-HAPE3D: an algorithm for irregular packing based on reinforcement learning[J]. Journal of Graphics, 2022, 43(6): 1096-1103. (in Chinese)
[5]	潘振华, 夏元清, 鲍泓, 等. 无人智能集群系统决策与控制研究进展[J]. 中国图象图形学报, 2024, 29(11): 3195-3215.
	PAN Z H, XIA Y Q, BAO H, et al. Research progress in decision-making for unmanned intelligent swarm system and control[J]. Journal of Image and Graphics, 2024, 29(11): 3195-3215. (in Chinese)
[6]	张晨, 蒋文英, 陈思源, 等. 基于双层DQN的多智能体路径规划[J]. 中国图象图形学报, 2023, 28(7): 2167-2181.
	ZHANG C, JIANG W Y, CHEN S Y, et al. Multi-agent path planning based on improved double DQN[J]. Journal of Image and Graphics, 2023, 28(7): 2167-2181. (in Chinese)
[7]	王刚锋, 张寰, 刘思濛, 等. 基于语义工艺知识的驱动桥装配序列规划研究[J]. 图学学报, 2024, 45(3): 564-574. DOI
	WANG G F, ZHANG H, LIU S M, et al. Research on drive axle assembly sequence planning based on semantic process knowledge[J]. Journal of Graphics, 2024, 45(3): 564-574. (in Chinese) DOI
[8]	李凯文, 张涛, 王锐, 等. 基于深度强化学习的组合优化研究进展[J]. 自动化学报, 2021, 47(11): 2521-2537.
	LI K W, ZHANG T, WANG R, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning[J]. Acta Automatica Sinica, 2021, 47(11): 2521-2537. (in Chinese)
[9]	刘佳, 张晶晶, 杨胜强, 等. 百叶轮抛磨叶片微结构区域识别及路径拼接方法研究[J]. 图学学报, 2022, 43(4): 715-720.
	LIU J, ZHANG J J, YANG S Q, et al. Research on microstructure region identification and path splicing method of abrasive cloth wheel polishing blade[J]. Journal of Graphics, 2022, 43(4): 715-720. (in Chinese)
[10]	杨延璞, 雷紫荆, 兰晨昕, 等. 融合贝叶斯网络与前景理论的产品工业设计多阶段决策方法[J]. 图学学报, 2022, 43(3): 537-547.
	YANG Y P, LEI Z J, LAN C X, et al. Multistage decision-making method of product industrial design by integrating Bayesian network and prospect theory[J]. Journal of Graphics, 2022, 43(3): 537-547. (in Chinese)
[11]	丁进良, 杨翠娥, 陈远东, 等. 复杂工业过程智能优化决策系统的现状与展望[J]. 自动化学报, 2018, 44(11): 1931-1943.
	DING J L, YANG C E, CHEN Y D, et al. Research Progress and Prospects of Intelligent Optimization Decision Making in Complex Industrial Process[J]. Acta Automatica Sinica, 2018, 44(11): 1931-1943. (in Chinese)
[12]	RACHIT D, PULKIT A, DEEPAK P, et al. Investigating human priors for playing video games[EB/OL]. [2024-06-09]. http://proceedings.mlr.press/v80/dubey18a.html.
[13]	CHEVALIER-BOISVERT M, DAI B, TOWERS M, et al. Minigrid & miniworld: modular & customizable reinforcement learning environments for goal-oriented tasks[C]// The 37th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2023: 3209.
[14]	HARRIES L, LEE S, RZEPECKI J, et al. MazeExplorer: a customisable 3D benchmark for assessing generalisation in reinforcement learning[C]// 2019 IEEE Conference on Games. New York: IEEE Press, 2019: 1-4.
[15]	WANG T H, WANG J H, WU Y, et al. Influence-based multi-agent exploration[EB/OL]. [2024-06-09]. https://iclr.cc/virtual_2020/poster_BJgy96EYvr.html.
[16]	BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 1479-1487.
[17]	OSTROVSKI G, BELLEMARE M G, VAN DEN OORD A, et al. Count-based exploration with neural density models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3305890.3305962.
[18]	MARTIN J, NARAYANAN S S, EVERITT T, et al. Count-based exploration in feature space for reinforcement learning[C]// The 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 2471-2478.
[19]	TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: a study of count-based exploration for deep reinforcement learning[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2750-2759.
[20]	HENAFF M, RAILEANU R, JIANG M Q, et al. Exploration via elliptical episodic bonuses[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2728.
[21]	STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331.
[22]	XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653.
[23]	STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1507.00814.pdf.
[24]	PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 488-489.
[25]	BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1808.04355.
[26]	BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1810.12894.
[27]	LEE L, EYSENBACH B, PARISOTTO E, et al. Efficient exploration via state marginal matching[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1906.05274.pdf.
[28]	RAILEANU R, ROCKTASCHEL T. RIDE: rewarding impact-driven exploration for procedurally generated environments[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2002.12292.
[29]	ZHANG T J, XU H Z, WANG X L, et al. NovelD: a simple yet effective exploration criterion[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1931.
[30]	HAZAN E, KAKADE S, SINGH K, et al. Provably efficient maximum entropy exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1812.02690.
[31]	ZHANG C H, CAI Y Y, HUANG L B, et al. Exploration by maximizing Rényi entropy for reward-free RL framework[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2006.06193.
[32]	MUTTI M, PRATISSOLI L, RESTELLI M. A policy gradient method for task-agnostic exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2007.04640v1.
[33]	LIU H, ABBEEL P. Behavior from the void: unsupervised active pre-training[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1411.
[34]	ZHANG T J, RASHIDINEJAD P, JIAO J T, et al. MADE: exploration via maximizing deviation from explored regions[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 740.
[35]	XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 11717-11725.
[36]	BELLEMARE M G, VENESS J, TALVITIE E. Skip context tree switching[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045055.
[37]	BELLEMARE M G, NADDAF Y, VENESS J, et al. The arcade learning environment: an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253-279.
[38]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[39]	VAN DEN OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.5555/3045390.3045575.
[40]	CHARIKAR M S. Similarity estimation techniques from rounding algorithms[C]// The 34th Annual ACM Symposium on Theory of Computing. New York: ACM, 2002: 380-388.
[41]	AGARWAL A, KAKADE S, HENAFF M, et al. PC-PG: policy cover directed exploration for provable policy gradient learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1124.
[42]	REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045035.
[43]	MELEKHOV I, KANNALA J, RAHTU E. Siamese network features for image matching[C]// The 23rd International Conference on Pattern Recognition. New York: IEEE Press, 2016: 378-383.
[44]	KHADKA S, KAGAN T. Evolution-guided policy gradient in reinforcement learning[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1196-1208.
[45]	SEO Y, CHEN L L, SHIN J, et al. State entropy maximization with random encoders for efficient exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2102.09430.
[46]	KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative evolutionary reinforcement learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1905.00976.
[47]	HAO J Y, LI P Y, TANG H Y, et al. ERL-Re²: efficient evolutionary reinforcement learning with shared state representation and individual policy representation[EB/OL]. [2024-06-09]. https://openreview.net/forum?id=FYZCHEtt6H0.
[48]	PARKER-HOLDER J, PACCHIANO A, CHOROMANSKI K, et al. Effective diversity in population based reinforcement learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1515.
[49]	EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity is all you need: learning skills without a reward function[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1802.06070.
[50]	XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.24963/ijcai.2023/37.
[51]	Xu P, Zhang J G, Huang K Q. Population-based diverse exploration for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://www.ijcai.org/proceedings/2024/32.
[52]	TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2307.09288.pdf.
[53]	LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6382-6393.
[54]	KEMPKA M, WYDMUCH M, RUNC G, et al. VIZDoom: a doom-based AI research platform for visual reinforcement learning[C]// 2016 IEEE Conference on Computational Intelligence and Games. New York: IEEE Press, 2016: 1-8.