Humanoid robotic development has for the better part of two decades moved at a snail’s pace but rapid acceleration is underway thanks to a collaboration between Figure AI and OpenAI with the result being the most stunning bit of real humanoid robot video we’ve ever seen.
On Wednesday, startup robotics firm Figure AI released a video update of its Figure 01 robot running a new Visual Language Model (VLM) that has somehow transformed the bot from a rather uninteresting automaton into a full-fledged sci-fi bot that approaches C-3PO-level capabilities.
In the video, Figure 01 stands behind a table set with a plate, an apple, and a cup. To the left is a drainer. A human stands in front of the robot and asks the robot, “Figure 01, what do you see right now?”
After a few seconds, Figure 01 responds in a remarkably human-sounding voice (there is no face, just an animated light that moves in sync with the voice), detailing everything on the table and the details of the man standing before it.
Then the man asks, “Hey, can I have something to eat?”
Figure 01 responds, “Sure thing” and then with a dextrous flourish of fluid movement, picks up the apple and hands it to the guy.
Next, the man empties some crumpled debris from a bin in front of Figure 01 while asking, “Can you explain why you did what you just did while you pick up this trash?”
Figure 01 wastes no time explaining its reasoning while placing the paper back into the bin. “So, I gave you the apple because it’s the only edible item I could provide you with from the table.”
Speech-to-speech
The company explained in a release that Figure 01 engages in “speech-to-speech” reasoning using OpenAI’s pre-trained multimodal model, VLM, to understand images and texts and relies on an entire voice conversation to craft its responses. This is different than, say, OpenAI’s GPT-4, which focuses on written prompts.
It’s also using what the company calls “learned low-level bimanual manipulation.” The system matches precise image calibrations (down to a pixel level) with its neural network to control movement. “These networks take in onboard images at 10hz, and generate 24-DOF actions (wrist poses and finger joint angles) at 200hz,” Figure AI wrote in a release.
The company claims that every behavior in the video is based on system learning and is not teleoperated, meaning there’s no one behind-the-scenes puppeteering Figure 01.
Without seeing Figure 01 in person, and asking my own questions, it’s hard to verify these claims. There is the possibility that this is not the first time Figure 01 has run through this routine. It could’ve been the 100th time, which might account for its speed and fluidity.