LDSC: Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning

Chak Lam Shek1, Pratap Tokekar1

1. University of Maryland, College Park

Abstract

We propose LDSC, a framework integrating LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability in Hierarchical Reinforcement Learning. LDSC improves exploration efficiency and learning performance, outperforming baselines by 55.9% in average reward across diverse tasks.

LDSC Framework

Method

LDSC operates in three stages: LLM-based subgoal generation from task descriptions, reusable option learning and selection, and an action-level policy. The framework enhances task decomposition and reusability across multiple tasks by leveraging semantic reasoning and hierarchical structures.

LDSC Framework

LLM Integration

We utilize Large Language Models (LLMs) to generate meaningful subgoals and guide hierarchical reinforcement learning through semantic understanding of tasks. The LLM enhances the agent's ability to decompose complex tasks and reuse learned options efficiently across different environments.

LLM Integration

Results

LDSC outperforms existing methods in complex tasks, improving success rates by 72.7%, reducing task completion time by 53.1%, and achieving higher average rewards across different environments like Maze, Four Rooms, E-Maze, and Tunnel.

Quantitative Results

Maze Environment
FourRoom Environment
E-Maze Environment
Tunnel Environment
Maze Error
FourRoom Error
E-Maze Error
Tunnel Error
Legend
Map DSC DDPG Option-Critic LDSC (Ours)
Success Rate ↑ Time (s) ↓ Success Rate ↑ Time (s) ↓ Success Rate ↑ Time (s) ↓ Success Rate ↑ Time (s) ↓
Maze 0% ± 0% 1063 ± 274 0% ± 0% 1187 ± 116 0% ± 0% 1030 ± 147 100% ± 0% 485 ± 174
FourRoom 86% ± 8% 1035 ± 479 0% ± 0% 1331 ± 355 0% ± 0% 1181 ± 500 95% ± 2% 678 ± 208
E-Maze 0% ± 0% 1345 ± 145 0% ± 0% 1344 ± 83 0% ± 0% 960 ± 306 100% ± 0% 86.5 ± 12.5
Tunnel 0% ± 0% 1368 ± 370 0% ± 0% 1527 ± 527 0% ± 0% 1349 ± 349 81.8% ± 3.6% 906 ± 306

Qualitative Results

Qualitative performance of the robot in the Point Maze environment. The upper row shows the initial set for each option, illustrating the state space coverage where the option can be executed. The lower row displays the corresponding policy plots, where orange regions indicate areas where the policy continues execution, while green regions signify termination states. The robot follows a structured sequence: first reaching subgoal 1 (top-right), then subgoal 2 (bottom-right), and finally the goal (top-left).

Initial Set 1
Initial Set 2
Initial Set 3
Initial Set 4
Initial Set 5
Initial Set 6
Initial Set 7
Policy Plot 1
Policy Plot 2
Policy Plot 3
Policy Plot 4
Policy Plot 5
Policy Plot 6
Policy Plot 7
Legend

Trajectory Example

Trajectory Example