Journal of Graphics ›› 2024, Vol. 45 ›› Issue (6): 1165-1177.DOI: 10.11996/JG.j.2095-302X.2024061165
• Special Topic on “Large Models and Graphics Technology and Applications” • Previous Articles Next Articles
Received:
2024-08-09
Accepted:
2024-10-29
Online:
2024-12-31
Published:
2024-12-24
Contact:
HUANG Kaiqi
About author:
First author contact:XU Pei (1993-), assistant researcher, Ph.D. His main research interests cover reinforcement learning, multi-agent learning. E-mail:pei.xu@ia.ac.cn
Supported by:
CLC Number:
XU Pei, HUANG Kaiqi. An efficient reinforcement learning method based on large language model[J]. Journal of Graphics, 2024, 45(6): 1165-1177.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.txxb.com.cn/EN/10.11996/JG.j.2095-302X.2024061165
Fig. 1 Schematics of the interaction paradigm for reinforcement learning agents ((a) The standard paradigm; (b) The interaction paradigm after combing a single-policy exploration method; (c) The interaction paradigm after combing a population-based exploration approach)
Fig. 8 Performance of each method in MiniGrid ((a) Easy: MultiRoom-N7-S8; (b) Easy: MultiRoom-N12-S10; (c) Medium: KeyCorridorS6R3; (d) Medium: ObstrictedMaze-2Dlh; (e) Hard: ObstrictedMaze-1Q; (f) Hard: ObstrictedMaze-Full)
MiniGrid环境 任务名称 | 方法 | |||
---|---|---|---|---|
RND | S-RND | NovelD | S-NovelD | |
MultiRoom-N12-S10 | 2.65 | 3.87 | 4.05 | 4.08 |
KeyCorridorS6R3 | 2.89 | 3.45 | 3.72 | 3.91 |
ObstructedMaze-1Q | 0.91 | 1.35 | 2.25 | 2.39 |
Table 1 State visit entropy for each method under the reward-free setting
MiniGrid环境 任务名称 | 方法 | |||
---|---|---|---|---|
RND | S-RND | NovelD | S-NovelD | |
MultiRoom-N12-S10 | 2.65 | 3.87 | 4.05 | 4.08 |
KeyCorridorS6R3 | 2.89 | 3.45 | 3.72 | 3.91 |
ObstructedMaze-1Q | 0.91 | 1.35 | 2.25 | 2.39 |
MPE环境 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Push_Box | 130.2±10.7 | 156.2±8.5 (+20%) |
Pass | 263.7±9.9 | 300.6±10.3 (+14%) |
Secret_Room | 148.9±3.2 | 171.2±4.5 (+15%) |
Room | 565.1±7.2 | 593.5±6.4 (+5%) |
Table 2 Number of Explored States for each method under the reward-free setting/K
MPE环境 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Push_Box | 130.2±10.7 | 156.2±8.5 (+20%) |
Pass | 263.7±9.9 | 300.6±10.3 (+14%) |
Secret_Room | 148.9±3.2 | 171.2±4.5 (+15%) |
Room | 565.1±7.2 | 593.5±6.4 (+5%) |
VizDoom 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Maze-S20 | 95±1 | 96±0 |
Maze-S50 | 84±3 | 87±4 |
Maze-S80 | 71±2 | 80±4 |
Table 3 Coverage of the state space for each method under the reward-free setting/%
VizDoom 任务名称 | 方法 | |
---|---|---|
MAE | S-MAE | |
Maze-S20 | 95±1 | 96±0 |
Maze-S50 | 84±3 | 87±4 |
Maze-S80 | 71±2 | 80±4 |
VizDoom 任务名称 | S-RND | S-MAE | ||
---|---|---|---|---|
Oracle | 大模型 | Oracle | 大模型 | |
Maze-S20 | 96±0 | 95±1 | 96±0 | 94±4 |
Maze-S50 | 81±1 | 79±3 | 87±4 | 86±8 |
Maze-S80 | 67±1 | 65±3 | 80±4 | 75±7 |
Table 4 Impact of using different ways to extract state semantics on performance/%
VizDoom 任务名称 | S-RND | S-MAE | ||
---|---|---|---|---|
Oracle | 大模型 | Oracle | 大模型 | |
Maze-S20 | 96±0 | 95±1 | 96±0 | 94±4 |
Maze-S50 | 81±1 | 79±3 | 87±4 | 86±8 |
Maze-S80 | 67±1 | 65±3 | 80±4 | 75±7 |
[1] | 赵静, 裴子楠, 姜斌, 等. 基于深度强化学习的无人机虚拟管道视觉避障[J]. 自动化学报, 2024, 50(11): 1-14. |
ZHAO J, PEI Z N, JIANG B, et al. Virtual tube visual obstacle avoidance for UAV based on deep reinforcement learning[J]. Acta Automatica Sinica, 2024, 50(11): 1-14. (in Chinese) | |
[2] |
邢晓月, 陶秀挺, 王姝钫, 等. 面向喷头空行程长度最小化的多零件优化布局方法[J]. 图学学报, 2023, 44(6): 1239-1250.
DOI |
XING X Y, TAO X T, WANG S F, et al. Optimizing the multi-part layout to minimize the empty travel distance of nozzle[J]. Journal of Graphics, 2023, 44(6): 1239-1250. (in Chinese) | |
[3] | 朱蕾, 蒋福临, 李琳. 预测网络辅助下徽派村落街坊布局生成方法[J]. 图学学报, 2022, 43(5): 909-917. |
ZHU L, JIANG F L, LI L. Generation method of neighborhoods layout in Hui-style villages with the aid of prediction network[J]. Journal of Graphics, 2022, 43(5): 909-917. (in Chinese) | |
[4] | 朱鹏辉, 袁宏涛, 聂勇伟, 等. AC-HAPE3D: 基于强化学习的异形填充算法[J]. 图学学报, 2022, 43(6): 1096-1103. |
ZHU P H, YUAN H T, NIE Y W, et al. AC-HAPE3D: an algorithm for irregular packing based on reinforcement learning[J]. Journal of Graphics, 2022, 43(6): 1096-1103. (in Chinese) | |
[5] | 潘振华, 夏元清, 鲍泓, 等. 无人智能集群系统决策与控制研究进展[J]. 中国图象图形学报, 2024, 29(11): 3195-3215. |
PAN Z H, XIA Y Q, BAO H, et al. Research progress in decision-making for unmanned intelligent swarm system and control[J]. Journal of Image and Graphics, 2024, 29(11): 3195-3215. (in Chinese) | |
[6] | 张晨, 蒋文英, 陈思源, 等. 基于双层DQN的多智能体路径规划[J]. 中国图象图形学报, 2023, 28(7): 2167-2181. |
ZHANG C, JIANG W Y, CHEN S Y, et al. Multi-agent path planning based on improved double DQN[J]. Journal of Image and Graphics, 2023, 28(7): 2167-2181. (in Chinese) | |
[7] |
王刚锋, 张寰, 刘思濛, 等. 基于语义工艺知识的驱动桥装配序列规划研究[J]. 图学学报, 2024, 45(3): 564-574.
DOI |
WANG G F, ZHANG H, LIU S M, et al. Research on drive axle assembly sequence planning based on semantic process knowledge[J]. Journal of Graphics, 2024, 45(3): 564-574. (in Chinese)
DOI |
|
[8] | 李凯文, 张涛, 王锐, 等. 基于深度强化学习的组合优化研究进展[J]. 自动化学报, 2021, 47(11): 2521-2537. |
LI K W, ZHANG T, WANG R, et al. Research reviews of combinatorial optimization methods based on deep reinforcement learning[J]. Acta Automatica Sinica, 2021, 47(11): 2521-2537. (in Chinese) | |
[9] | 刘佳, 张晶晶, 杨胜强, 等. 百叶轮抛磨叶片微结构区域识别及路径拼接方法研究[J]. 图学学报, 2022, 43(4): 715-720. |
LIU J, ZHANG J J, YANG S Q, et al. Research on microstructure region identification and path splicing method of abrasive cloth wheel polishing blade[J]. Journal of Graphics, 2022, 43(4): 715-720. (in Chinese) | |
[10] | 杨延璞, 雷紫荆, 兰晨昕, 等. 融合贝叶斯网络与前景理论的产品工业设计多阶段决策方法[J]. 图学学报, 2022, 43(3): 537-547. |
YANG Y P, LEI Z J, LAN C X, et al. Multistage decision-making method of product industrial design by integrating Bayesian network and prospect theory[J]. Journal of Graphics, 2022, 43(3): 537-547. (in Chinese) | |
[11] | 丁进良, 杨翠娥, 陈远东, 等. 复杂工业过程智能优化决策系统的现状与展望[J]. 自动化学报, 2018, 44(11): 1931-1943. |
DING J L, YANG C E, CHEN Y D, et al. Research Progress and Prospects of Intelligent Optimization Decision Making in Complex Industrial Process[J]. Acta Automatica Sinica, 2018, 44(11): 1931-1943. (in Chinese) | |
[12] | RACHIT D, PULKIT A, DEEPAK P, et al. Investigating human priors for playing video games[EB/OL]. [2024-06-09]. http://proceedings.mlr.press/v80/dubey18a.html. |
[13] | CHEVALIER-BOISVERT M, DAI B, TOWERS M, et al. Minigrid & miniworld: modular & customizable reinforcement learning environments for goal-oriented tasks[C]// The 37th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2023: 3209. |
[14] | HARRIES L, LEE S, RZEPECKI J, et al. MazeExplorer: a customisable 3D benchmark for assessing generalisation in reinforcement learning[C]// 2019 IEEE Conference on Games. New York: IEEE Press, 2019: 1-4. |
[15] | WANG T H, WANG J H, WU Y, et al. Influence-based multi-agent exploration[EB/OL]. [2024-06-09]. https://iclr.cc/virtual_2020/poster_BJgy96EYvr.html. |
[16] | BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation[C]// The 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 1479-1487. |
[17] | OSTROVSKI G, BELLEMARE M G, VAN DEN OORD A, et al. Count-based exploration with neural density models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3305890.3305962. |
[18] | MARTIN J, NARAYANAN S S, EVERITT T, et al. Count-based exploration in feature space for reinforcement learning[C]// The 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 2471-2478. |
[19] | TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: a study of count-based exploration for deep reinforcement learning[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2750-2759. |
[20] | HENAFF M, RAILEANU R, JIANG M Q, et al. Exploration via elliptical episodic bonuses[C]// The 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 2728. |
[21] | STREHL A L, LITTMAN M L. An analysis of model-based interval estimation for Markov decision processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331. |
[22] | XU P, YIN Q Y, ZHANG J G, et al. Deep reinforcement learning with part-aware exploration bonus in video games[J]. IEEE Transactions on Games, 2022, 14(4): 644-653. |
[23] | STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1507.00814.pdf. |
[24] | PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. New York: IEEE Press, 2017: 488-489. |
[25] | BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1808.04355. |
[26] | BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1810.12894. |
[27] | LEE L, EYSENBACH B, PARISOTTO E, et al. Efficient exploration via state marginal matching[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1906.05274.pdf. |
[28] | RAILEANU R, ROCKTASCHEL T. RIDE: rewarding impact-driven exploration for procedurally generated environments[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2002.12292. |
[29] | ZHANG T J, XU H Z, WANG X L, et al. NovelD: a simple yet effective exploration criterion[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1931. |
[30] | HAZAN E, KAKADE S, SINGH K, et al. Provably efficient maximum entropy exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1812.02690. |
[31] | ZHANG C H, CAI Y Y, HUANG L B, et al. Exploration by maximizing Rényi entropy for reward-free RL framework[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2006.06193. |
[32] | MUTTI M, PRATISSOLI L, RESTELLI M. A policy gradient method for task-agnostic exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2007.04640v1. |
[33] | LIU H, ABBEEL P. Behavior from the void: unsupervised active pre-training[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 1411. |
[34] | ZHANG T J, RASHIDINEJAD P, JIAO J T, et al. MADE: exploration via maximizing deviation from explored regions[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 740. |
[35] | XU P, ZHANG J G, YIN Q Y, et al. Subspace-aware exploration for sparse-reward multi-agent tasks[C]// The 37th AAAI Conference on Artificial Intelligence. Washington DC: AAAI, 2023: 11717-11725. |
[36] | BELLEMARE M G, VENESS J, TALVITIE E. Skip context tree switching[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045055. |
[37] | BELLEMARE M G, NADDAF Y, VENESS J, et al. The arcade learning environment: an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 2013, 47: 253-279. |
[38] | MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533. |
[39] | VAN DEN OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.5555/3045390.3045575. |
[40] | CHARIKAR M S. Similarity estimation techniques from rounding algorithms[C]// The 34th Annual ACM Symposium on Theory of Computing. New York: ACM, 2002: 380-388. |
[41] | AGARWAL A, KAKADE S, HENAFF M, et al. PC-PG: policy cover directed exploration for provable policy gradient learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1124. |
[42] | REZENDE D J, MOHAMED S, WIERSTRA D. Stochastic backpropagation and approximate inference in deep generative models[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/abs/10.5555/3044805.3045035. |
[43] | MELEKHOV I, KANNALA J, RAHTU E. Siamese network features for image matching[C]// The 23rd International Conference on Pattern Recognition. New York: IEEE Press, 2016: 378-383. |
[44] | KHADKA S, KAGAN T. Evolution-guided policy gradient in reinforcement learning[C]// The 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1196-1208. |
[45] | SEO Y, CHEN L L, SHIN J, et al. State entropy maximization with random encoders for efficient exploration[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2102.09430. |
[46] | KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative evolutionary reinforcement learning[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1905.00976. |
[47] | HAO J Y, LI P Y, TANG H Y, et al. ERL-Re2: efficient evolutionary reinforcement learning with shared state representation and individual policy representation[EB/OL]. [2024-06-09]. https://openreview.net/forum?id=FYZCHEtt6H0. |
[48] | PARKER-HOLDER J, PACCHIANO A, CHOROMANSKI K, et al. Effective diversity in population based reinforcement learning[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1515. |
[49] | EYSENBACH B, GUPTA A, IBARZ J, et al. Diversity is all you need: learning skills without a reward function[EB/OL]. [2024-06-09]. https://arxiv.org/abs/1802.06070. |
[50] | XU P, ZHANG J G, HUANG K Q. Exploration via joint policy diversity for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://dl.acm.org/doi/10.24963/ijcai.2023/37. |
[51] | Xu P, Zhang J G, Huang K Q. Population-based diverse exploration for sparse-reward multi-agent tasks[EB/OL]. [2024-06-09]. https://www.ijcai.org/proceedings/2024/32. |
[52] | TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models[EB/OL]. [2024-06-09]. https://arxiv.org/abs/2307.09288.pdf. |
[53] | LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]// The 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6382-6393. |
[54] | KEMPKA M, WYDMUCH M, RUNC G, et al. VIZDoom: a doom-based AI research platform for visual reinforcement learning[C]// 2016 IEEE Conference on Computational Intelligence and Games. New York: IEEE Press, 2016: 1-8. |
[1] | CHEN Xiaojiao, SHU Yunfeng, WANG Ruihan, ZHOU Jiahuan, CHEN Wei. Large language model powered UI evaluation system [J]. Journal of Graphics, 2024, 45(6): 1178-1187. |
[2] | YU Han, CHEN Zhiyuan, XIONG Xirui, DAI Yuanxing, CAI Hongming. Intelligent MBSE design approach based on retrieval augmented large language model [J]. Journal of Graphics, 2024, 45(6): 1188-1199. |
[3] | XU Jinglin, PENG Yang, OU Jinwu, TAN Junjie, SHU Jiangpeng, YU Fangqiang. An intelligent maintenance system for public buildings integrating digital twin and large language model [J]. Journal of Graphics, 2024, 45(6): 1200-1206. |
[4] | WU Jingyi, JING Jun, HE Yifan, ZHANG Shiyu, KANG Yunfeng, TANG Wei, KONG Delan, LIU Xiangdong. Traffic anomaly event analysis method for highway scenes based on multimodal large language models [J]. Journal of Graphics, 2024, 45(6): 1266-1276. |
[5] | JIANG Can, ZHENG Zhe, LIANG Xiong, LIN Jiarui, MA Zhiliang, LU Xinzheng. A new interaction paradigm for building design driven by large language model: proof of concept with Rhino7 [J]. Journal of Graphics, 2024, 45(3): 594-600. |
[6] | WU Yi-he , ZHANG Zhen-ning , QIU Dong , LI Wei-qing , SU Zhi-yong. Research on adaptive grasping of virtual hands based on deep reinforcement learning [J]. Journal of Graphics, 2021, 42(3): 462-469. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||