Gestures may be as important as words
Brown University
Whether in the kitchen or on a workshop floor, robot assistants that can fetch items for people could be extremely useful. Now, a team of Brown University researchers has developed a way of making robots better at figuring out exactly which items a user might want them to retrieve.
The new approach enables robots to use inputs from both
human language and gesture as they reason about how to locate and retrieve
target objects. In a study that
will be presented on Tuesday, March 17, during the International Conference on
Human-Robot Interaction in Edinburgh, Scotland, the researchers show
that the approach had an 89% success rate in finding the correct object in
complex environments, outperforming other object retrieval approaches.
“Searching for things requires a robot to navigate large environments,” said Ivy He, a graduate student at Brown and the study’s lead author. “With current technology, robots are pretty good at identifying objects, but when the environment is cluttered, things are moving around or things are hidden by other objects, that makes things much more difficult. So this work is about using both language and gesture to help in that search task.”
The research makes use of an approach to robot planning
called a POMDP (partially observable Markov decision process), a mathematical
framework that allows a robot to reason under uncertainty. In the real world,
robots rarely have a perfect understanding of the world. Different types of
objects can look similar. There may be more than one of a particular object in
a room. Items might be partially or completely hidden from view.
To succeed in a search, a robot has to act even when it isn’t sure what it’s seeing. Without a way to manage that uncertainty, it might freeze. Or worse, it might make overconfident final decisions based on incomplete information. A POMDP turns ambiguities into a probabilistic framework that helps the robot track how confident it is about what’s in the world, and update those beliefs according to new information, including information from large vision and language models. In the process, it can choose actions that help it learn more — for example, moving to get a better view — before committing to a final decision.
The innovation in this latest research is a POMDP that
incorporates inputs from both language and human gestures, such as pointing
toward the object of interest. To incorporate the gesture component, He drew on
insights from a Brown laboratory led by Associate Professor of Cognitive and
Psychological Sciences Daphna Buchsbaum, on how the undisputed world
champions of fetch — dogs — interpret human pointing.
Building on this expertise, He and Ph.D. student Madeline
Pelgrim performed a study of the finer points of human pointing, as well
as how dogs interpret pointing gestures. The study helped He to model the
target of a pointing gesture within a cone of probability.
“What we have found is that humans use eye gaze to align
with what they’re pointing to,” He said. “So it was natural to create a cone
based on a connecting line from the eye to elbow to the wrist. That turns out
to be a fairly good approximation of where someone is pointing.”
Buchsbaum adds, “Our work in the Brown Dog Lab has shown just how sophisticated
dogs are in their communication with humans, solving many of the cooperation
problems we want robots to solve. This makes them a natural model for intuitive
human-non-human cooperation. This work translates the dog's intuitive
understanding of human gaze and pointing into a probabilistic model, which
allows the robot to handle the ambiguity inherent in human communication. It
moves us closer to truly intuitive robotic assistants.”
He then combined the gesture model with a vision language
model or VLM, an AI system designed to interpret visual scenes together with
natural language descriptions. The result was a POMDP capable of incorporating
both language and gesture for robot planning.
In lab experiments, the researchers asked a quadruped robot
to find various objects scattered around the lab space. The experiments showed
that the robot was able to locate the correct object nearly 90% of time using
combined gesture and language, far better than using either input alone.
For He and her coauthors, the research is a step toward
robots that are able to operate side-by-side with people at home and in the
workplace.
“The framework we developed helps pave the way for seamless
multimodal human-robot interaction,” said research co-author Jason Liu, a
postdoctoral researcher at MIT who worked on the project while completing his
Ph.D. at Brown. “In the future, we can communicate with our assistant robots
the same way people interact through language, gestures, eye gazes,
demonstrations and much more.”
The work was supported through Brown’s AI Research Institute
on Interaction for AI Assistants (ARIA), which is funded by the National
Science Foundation.
"This is a really great illustration of how we can
enable more natural and effective human-machine interaction by strengthening
collaborations between computer science and cognitive science,” said Ellie
Pavlick, an associate professor of computer science at Brown who leads ARIA.
“Embracing what we know about how humans naturally want to communicate, and
building systems aligned with those human tendencies and intuitions about
behavior, is the right way forward.”
The work was supported by the National Science Foundation
(2433429, GR5250131) and the Office of Naval Research (N0001424-1-2784,
N0001424-1-2603).

.webp)