Inferring Intentions for Active Assistance

How VLMs Can Turn Robots into Proactive Helpers

Large language models (LLMs) have taken the world by storm. They are now so common that they’re integrated into personal messaging apps. Some have evolved into vision-language models (VLMs), capable of understanding images and answering questions about them. This revolutionary technology has also made waves in Embodied AI (EAI) research.

As defined by Julian Eßer, Embodied AI integrates AI into physical systems, such as robots, enabling them to interact meaningfully with their surroundings.

With VLMs, these physical systems can see the world through a camera, understand it, and interact with humans using natural language.

How VLMs Enhance Embodied AI

Given these capabilities, VLMs allow humans to instruct robots naturally (e.g., "bring me some sugar"). The VLM/LLM translates these requests into structured commands for the robot. But if VLMs are world models, can they anticipate our needs instead of waiting for instructions?

The Experiment: Predicting Human Needs

To explore this, my colleagues and I worked on one of the most exciting projects of my PhD. We asked: Can an AI assistant with a camera understand what I’m doing, predict what I may need next, and instruct a robot to fetch it, all without me explicitely talking to the assistant?

Setting Up the Experiment

We built a house-like environment in our lab using Amazon-bought curtains as walls, zip ties to hold them up, and warnings from facility management to keep them away from fire hoses. These curtains created separate rooms.

We mapped the "home", labeling rooms (pantry, dining room, etc.) and setting predefined locations where the robot should go. For perception, we used an overhead camera at the workbench (our "kitchen" area). I favor overhead views for robot tasks because they provide a lot of relevant information (e.g., where a robot can move) without unnecessary detail (e.g., a full 3D map isn't required for path planning). While we used an overhead camera, any sufficiently wide-area camera (e.g., a smart device on the kitchen counter) could serve the same purpose.

Lab Map

Our helper was a TurtleBot, responsible for moving objects. Ideally, we’d use a mobile manipulator like Fetch from Hello Robot, but for this experiment, we assumed the robot could pick up objects once it reached them. Since the overhead camera had a limited view, the TurtleBot also had its own camera to locate objects independently.

TurtleBot Image

VLM-to-Robot Instructions

The robot navigation process involves many moving parts (pun intended). We used ROS to drive the robot around and implemented some high-level functions to complete the assigned tasks: