Inferring Intentions for Active Assistance

How VLMs Can Turn Robots into Proactive Helpers

Large language models (LLMs) have taken the world by storm. They are now so common that they’re integrated into personal messaging apps. Some have evolved into vision-language models (VLMs), capable of understanding images and answering questions about them. This revolutionary technology has also made waves in Embodied AI (EAI) research.

As defined by Julian Eßer, Embodied AI integrates AI into physical systems, such as robots, enabling them to interact meaningfully with their surroundings.

With VLMs, these physical systems can see the world through a camera, understand it, and interact with humans using natural language.

How VLMs Enhance Embodied AI

They enable interaction using natural, unstructured language rather than formal, structured formats like XML or PDDL (the latter is also possible with LLMs)
They act as world models, encoding vast knowledge about our environment thanks to their extensive training data

Given these capabilities, VLMs allow humans to instruct robots naturally (e.g., "bring me some sugar"). The VLM/LLM translates these requests into structured commands for the robot. But if VLMs are world models, can they anticipate our needs instead of waiting for instructions?

The Experiment: Predicting Human Needs

To explore this, my colleagues and I worked on one of the most exciting projects of my PhD. We asked: Can an AI assistant with a camera understand what I’m doing, predict what I may need next, and instruct a robot to fetch it, all without me explicitely talking to the assistant?

Setting Up the Experiment

We built a house-like environment in our lab using Amazon-bought curtains as walls, zip ties to hold them up, and warnings from facility management to keep them away from fire hoses. These curtains created separate rooms.

We mapped the "home", labeling rooms (pantry, dining room, etc.) and setting predefined locations where the robot should go. For perception, we used an overhead camera at the workbench (our "kitchen" area). I favor overhead views for robot tasks because they provide a lot of relevant information (e.g., where a robot can move) without unnecessary detail (e.g., a full 3D map isn't required for path planning). While we used an overhead camera, any sufficiently wide-area camera (e.g., a smart device on the kitchen counter) could serve the same purpose.

Our helper was a TurtleBot, responsible for moving objects. Ideally, we’d use a mobile manipulator like Fetch from Hello Robot, but for this experiment, we assumed the robot could pick up objects once it reached them. Since the overhead camera had a limited view, the TurtleBot also had its own camera to locate objects independently.

VLM-to-Robot Instructions

The robot navigation process involves many moving parts (pun intended). We used ROS to drive the robot around and implemented some high-level functions to complete the assigned tasks:

PointNav(<location>): PointNav refers to the autonomous navigation of the robot from one point (defined in terms of x,y coordinates) to another. Usually, PointNav is done in an unknown/unmapped environment. Here, we had access to the map, which helped in finding the shortest path and collision avoidance, in case someone walks in front of the robot. The argument to the function is mapped to a x-y location, and the robot moves to the corresponding place on the map.
ObjectNav(<object name>): The overhead camera only sees a small part of the house, and we used only one overhead camera, so the robot is on its own to find the target object. Instead of pre-mapping everything, we use open-vocabulary detection for ObjectNav, which means navigating to an object (defined by a natural language label, such as spoon). A big reason to avoid mapping all the objects is because many of these object are not always in the exact same place (do we keep the coffee mug at the same place everytime after using it?).

We employed YOLO-World, which allows the VLM to suggest a target object class (e.g., "spoon"), prompting the robot to rotate and scan for it. Once detected, the robot moves toward it, treating it like a PointNav task.
Pick(<object name>): If the robot had a manipulator, it could pick up and deliver the object, and therefore we kept a Pick(
) function in our schema. Since we didn’t have one at the time (but are working on it), we simulated object retrieval.
Assistance in action

In this example, the workbench, we took two images (with a three-second gap) and asked the LLM to analyze the scene:
Your browser does not support the video tag.
In this example, I had milk, a cup, and coffee around me and was interacting with them. The VLM correctly inferred that I was making coffee and noticed I lacked a spoon for scooping and stirring. It directed the robot to fetch one from the pantry.

The VLM would create this plan, which the robot would follow:
1. PointNav(pantry)
2. ObjectNav(spoon)
3. Pick(spoon)
4. PointNav(kitchen)
Note that the workbench is in the kitchen, therefore the robot comes back to the kitchen.

Here is how it all looked:

PointNav(pantry)

The red arrow on the left is the goal and the green curve is the planned path to it.
Your browser does not support the video tag.
ObjectNav(spoon)

The left image shows the raw image feed and the right one shows the detection result.
Your browser does not support the video tag.
PointNav(kitchen)
Your browser does not support the video tag.
Amazing, right?

How Good is the VLM in Inferring Intentions?

We wondered what if we present some complex situations to the VLM. Would it be able to understand our actions and needs correctly? Here are some case studies that we tried:

1. Predicting from Movement

If I prepared food in the kitchen and then started moving with it, the VLM inferred I was heading to the dining room for a meal. It instructed the robot to follow me or carry the food.

2. Context-Based Object Retrieval

If I was in the kitchen working on a plant with gloves, the VLM deduced I might need a gardening tool—even though none were visible. It sent the robot to fetch one from the garden.

3. Uncertain Scenarios

When working with a drone, the VLM couldn’t identify specific details but inferred that tools might be needed. It suggested searching the garage. A future enhancement could involve the robot asking the user which tool is required.

The Future of Non-Verbal AI Assistance

One aspect I love about this project is how it demonstrates world models enabling non-verbal communication. As these models improve, robots will become even more capable assistants. I’m excited to see where this leads.

One last curiosity: VLMs are also good at OCR (Optical Character Recognition). Just for fun, I tested one more scenario—and the results were impressive.

Acknowledgements

This project was done under the guidance of Dr. Pratap Tokekar at UMD College Park. I am grateful to Amisha Bhaskar and Zahir Mahammad for helping with the setup and experiments.

The post's contents were refined with ChatGPT's help.

- Vishnu Dutt Sharma

Inferring Intentions for Active Assistance

How VLMs Enhance Embodied AI

The Experiment: Predicting Human Needs

Setting Up the Experiment

VLM-to-Robot Instructions

Assistance in action

PointNav(pantry)

ObjectNav(spoon)

PointNav(kitchen)

How Good is the VLM in Inferring Intentions?

1. Predicting from Movement

2. Context-Based Object Retrieval

3. Uncertain Scenarios

The Future of Non-Verbal AI Assistance

Acknowledgements