We propose LDSC, a framework integrating LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability in Hierarchical Reinforcement Learning. LDSC improves exploration efficiency and learning performance, outperforming baselines by 55.9% in average reward across diverse tasks.
LDSC operates in three stages: LLM-based subgoal generation from task descriptions, reusable option learning and selection, and an action-level policy. The framework enhances task decomposition and reusability across multiple tasks by leveraging semantic reasoning and hierarchical structures.
We utilize Large Language Models (LLMs) to generate meaningful subgoals and guide hierarchical reinforcement learning through semantic understanding of tasks. The LLM enhances the agent's ability to decompose complex tasks and reuse learned options efficiently across different environments.
LDSC outperforms existing methods in complex tasks, improving success rates by 72.7%, reducing task completion time by 53.1%, and achieving higher average rewards across different environments like Maze, Four Rooms, E-Maze, and Tunnel.
Map | DSC | DDPG | Option-Critic | LDSC (Ours) | ||||
---|---|---|---|---|---|---|---|---|
Success Rate ↑ | Time (s) ↓ | Success Rate ↑ | Time (s) ↓ | Success Rate ↑ | Time (s) ↓ | Success Rate ↑ | Time (s) ↓ | |
Maze | 0% ± 0% | 1063 ± 274 | 0% ± 0% | 1187 ± 116 | 0% ± 0% | 1030 ± 147 | 100% ± 0% | 485 ± 174 |
FourRoom | 86% ± 8% | 1035 ± 479 | 0% ± 0% | 1331 ± 355 | 0% ± 0% | 1181 ± 500 | 95% ± 2% | 678 ± 208 |
E-Maze | 0% ± 0% | 1345 ± 145 | 0% ± 0% | 1344 ± 83 | 0% ± 0% | 960 ± 306 | 100% ± 0% | 86.5 ± 12.5 |
Tunnel | 0% ± 0% | 1368 ± 370 | 0% ± 0% | 1527 ± 527 | 0% ± 0% | 1349 ± 349 | 81.8% ± 3.6% | 906 ± 306 |
Qualitative performance of the robot in the Point Maze environment. The upper row shows the initial set for each option, illustrating the state space coverage where the option can be executed. The lower row displays the corresponding policy plots, where orange regions indicate areas where the policy continues execution, while green regions signify termination states. The robot follows a structured sequence: first reaching subgoal 1
(top-right), then subgoal 2
(bottom-right), and finally the goal
(top-left).