Abstract


Reinforcement Learning (RL) has shown remarkable progress in simulation environments, yet its application to real-world robotic tasks remains limited due to challenges in exploration and generalization. To address these issues, we introduce PLANRL, a framework that chooses when the robot should use classical motion planning and when it should learn a policy. To further improve the efficiency in exploration, we use imitation data to bootstrap the exploration. PLANRL dynamically switches between two modes of operation: reaching a waypoint using classical techniques when away from the objects and reinforcement learning for fine-grained manipulation control when about to interact with objects. PLANRL architecture is composed of ModeNet for mode classification, NavNet for waypoint prediction, and InteractNet for precise manipulation. By combining the strengths of RL and Imitation Learning (IL), PLANRL improves sample efficiency and mitigates distribution shift, ensuring robust task execution. We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods. In simulations, PLANRL surpasses baseline methods by 10-15% in training success rates at 30k samples and by 30-40% during evaluation phases. In real-world scenarios, it demonstrates a 30-40% higher success rate on simpler tasks compared to baselines and uniquely succeeds in complex, two-stage manipulation tasks



Method Overview
Architecture of PLANRL: During training, PLANRL learns to predict waypoints, low-level actions, and the operational mode at each time step. One network (InteractNet) predicts the low-level action at and the other network (ModeNet) predicts mode mt. A separate network (NavNet) predicts the high-level waypoint wt. At test time, the system samples mt and either moves to a waypoint (when mt = 0) using the predicted waypoint or follows a dense action (when mt = 1). The architecture allows for dynamic switching between motion-planning and interaction modes, facilitating robust performance in complex tasks. An example of how motion planning and interaction modes are integrated during execution is shown on the right.

ModeNet


ModeNet: Dynamic Mode Classification

  • Decision-Making: Identifies when to switch between Motion Planning and interaction modes based on input observations.
  • Adaptability: Enables the system to dynamically adapt its strategy, ensuring efficient task execution.
ModeNet Image
ModeNet Image

InteractNet


InteractNet: Precise Manipulation

  • Execution: Executes fine-grained manipulation tasks with precision, guided by learned RL policies.
  • Adaptation: Learns from demonstrations and adjusts movements in real-time for efficient task completion.
Interact Image

Simulation Results


Assembly Environment

Assembly

BoxClose Environment

BoxClose

CoffeePush Environment

CoffeePush

Assembly Environment
Assembly Environment
Assembly Environment
Assembly Environment
BoxClose Environment
CoffeePush Environment
Assembly Environment
BoxClose Environment
CoffeePush Environment
Assembly Environment
BoxClose Environment
CoffeePush Environment
Graph Legend

Real-World Experiments


PLANRL: Lift Env Training

For this setup we use only wrist-camera for BC policy, whereas for predicting waypoints both wrist and environment cameras are used.

Total Training Time: 40 minutes

Lift data collect

Data Collection using Teleoperation: Lift [No Randomization]

Zero-Shot Generalization with Fruit

Graph Legend
Lift data collect

Steps: 2k     Time: 10mins

Zero-Shot Generalization with Fruit

Steps: 4k     Time: 20mins

Lift data collect

Steps: 6k     Time: 30mins

Zero-Shot Generalization with Fruit

Steps: 8k     Time: 40mins

PLANRL: Pick and Place Env Training

This setup uses both wrist-camera and environment camera for BC policy and waypoint prediction. This task has 2 stages "pick" and "place" and it involves 3 waypoints as discussed in the paper.

Total Training Time: 3 hours

Pick Place data collect

Data Collection using Teleoperation: Pick and Place

Zero-Shot Generalization with Fruit

Graph Legend
image missing

Steps: 2k     Time: 10mins

image missing

Steps: 10k     Time: 50mins

image missing

Steps: 16k     Time: 90mins

image missing

Steps: 26k     Time: 130mins

Project Contributors