LDSC: Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning

Abstract

We propose LDSC, a framework integrating LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability in Hierarchical Reinforcement Learning. LDSC improves exploration efficiency and learning performance, outperforming baselines by 55.9% in average reward across diverse tasks.

Method

LDSC operates in three stages: LLM-based subgoal generation from task descriptions, reusable option learning and selection, and an action-level policy. The framework enhances task decomposition and reusability across multiple tasks by leveraging semantic reasoning and hierarchical structures.

LLM Integration

We utilize Large Language Models (LLMs) to generate meaningful subgoals and guide hierarchical reinforcement learning through semantic understanding of tasks. The LLM enhances the agent's ability to decompose complex tasks and reuse learned options efficiently across different environments.

Results

LDSC outperforms existing methods in complex tasks, improving success rates by 72.7%, reducing task completion time by 53.1%, and achieving higher average rewards across different environments like Maze, Four Rooms, E-Maze, and Tunnel.

Quantitative Results

Map	DSC		DDPG		Option-Critic		LDSC (Ours)
	Success Rate ↑	Time (s) ↓	Success Rate ↑	Time (s) ↓	Success Rate ↑	Time (s) ↓	Success Rate ↑	Time (s) ↓
Maze	0% ± 0%	1063 ± 274	0% ± 0%	1187 ± 116	0% ± 0%	1030 ± 147	100% ± 0%	485 ± 174
FourRoom	86% ± 8%	1035 ± 479	0% ± 0%	1331 ± 355	0% ± 0%	1181 ± 500	95% ± 2%	678 ± 208
E-Maze	0% ± 0%	1345 ± 145	0% ± 0%	1344 ± 83	0% ± 0%	960 ± 306	100% ± 0%	86.5 ± 12.5
Tunnel	0% ± 0%	1368 ± 370	0% ± 0%	1527 ± 527	0% ± 0%	1349 ± 349	81.8% ± 3.6%	906 ± 306

Qualitative Results

Qualitative performance of the robot in the Point Maze environment. The upper row shows the initial set for each option, illustrating the state space coverage where the option can be executed. The lower row displays the corresponding policy plots, where orange regions indicate areas where the policy continues execution, while green regions signify termination states. The robot follows a structured sequence: first reaching subgoal 1 (top-right), then subgoal 2 (bottom-right), and finally the goal (top-left).