How VLMs Can Turn Robots into Proactive Helpers
Large language models (LLMs) have taken the world by storm. They are now so common that they’re integrated into personal messaging apps. Some have evolved into vision-language models (VLMs), capable of understanding images and answering questions about them. This revolutionary technology has also made waves in Embodied AI (EAI) research.
As defined by Julian Eßer, Embodied AI integrates AI into physical systems, such as robots, enabling them to interact meaningfully with their surroundings.
With VLMs, these physical systems can see the world through a camera, understand it, and interact with humans using natural language.
Given these capabilities, VLMs allow humans to instruct robots naturally (
To explore this, my colleagues and I worked on one of the most exciting projects of my PhD. We asked: Can an AI assistant with a camera understand what I’m doing, predict what I may need next, and instruct a robot to fetch it, all without me explicitely talking to the assistant?
We built a house-like environment in our lab using Amazon-bought curtains as walls, zip ties to hold them up, and warnings from facility management to keep them away from fire hoses. These curtains created separate rooms.
We mapped the "home", labeling rooms (pantry, dining room, etc.) and setting predefined locations where the robot should go. For perception, we used an overhead camera at the workbench (our "kitchen" area). I favor overhead views for robot tasks because they provide a lot of relevant information (e.g., where a robot can move) without unnecessary detail (e.g., a full 3D map isn't required for path planning). While we used an overhead camera, any sufficiently wide-area camera (e.g., a smart device on the kitchen counter) could serve the same purpose.
Our helper was a TurtleBot, responsible for moving objects. Ideally, we’d use a mobile manipulator like Fetch from Hello Robot, but for this experiment, we assumed the robot could pick up objects once it reached them. Since the overhead camera had a limited view, the TurtleBot also had its own camera to locate objects independently.
The robot navigation process involves many moving parts (pun intended). We used ROS to drive the robot around and implemented some high-level functions to complete the assigned tasks:
We employed YOLO-World, which allows the VLM to suggest a target object class (e.g., "spoon"), prompting the robot to rotate and scan for it. Once detected, the robot moves toward it, treating it like a PointNav task.
In this example, the workbench, we took two images (with a three-second gap) and asked the LLM to analyze the scene:
In this example, I had milk, a cup, and coffee around me and was interacting with them. The VLM correctly inferred that I was making coffee and noticed I lacked a spoon for scooping and stirring. It directed the robot to fetch one from the pantry.
The VLM would create this plan, which the robot would follow:
Note that the workbench is in the kitchen, therefore the robot comes back to the kitchen.
Here is how it all looked:
The red arrow on the left is the goal and the green curve is the planned path to it.
The left image shows the raw image feed and the right one shows the detection result.
Amazing, right?
We wondered what if we present some complex situations to the VLM. Would it be able to understand our actions and needs correctly? Here are some case studies that we tried:
If I prepared food in the kitchen and then started moving with it, the VLM inferred I was heading to the dining room for a meal. It instructed the robot to follow me or carry the food.
If I was in the kitchen working on a plant with gloves, the VLM deduced I might need a gardening tool—even though none were visible. It sent the robot to fetch one from the garden.
When working with a drone, the VLM couldn’t identify specific details but inferred that tools might be needed. It suggested searching the garage. A future enhancement could involve the robot asking the user which tool is required.
One aspect I love about this project is how it demonstrates world models enabling non-verbal communication. As these models improve, robots will become even more capable assistants. I’m excited to see where this leads.
One last curiosity: VLMs are also good at OCR (Optical Character Recognition). Just for fun, I tested one more scenario—and the results were impressive.
This project was done under the guidance of Dr. Pratap Tokekar at UMD College Park. I am grateful to Amisha Bhaskar and Zahir Mahammad for helping with the setup and experiments.
The post's contents were refined with ChatGPT's help.