Generative AI taught a robot dog to scramble around a new environment

Teaching robots to navigate new environments is tough. You can train them on physical, real-world data taken from recordings made by humans, but that’s scarce and expensive to collect. Digital simulations are a rapid, scalable way to teach them to do new things, but the robots often fail when they’re pulled out of virtual worlds and asked to do the same tasks in the real one.

Now there’s a potentially better option: a new system that uses generative AI models in conjunction with a physics simulator to develop virtual training grounds that more accurately mirror the physical world. Robots trained using this method achieved a higher success rate in real-world tests than those trained using more traditional techniques.

Researchers used the system, called LucidSim, to train a robot dog in parkour, getting it to scramble over a box and climb stairs even though it had never seen any real-world data. The approach demonstrates how helpful generative AI could be when it comes to teaching robots to do challenging tasks. It also raises the possibility that we could ultimately train them in entirely virtual worlds. The research was presented at the Conference on Robot Learning (CoRL) last week.

“We’re in the middle of an industrial revolution for robotics,” says Ge Yang, a postdoc at MIT’s Computer Science and Artificial Intelligence Laboratory, who worked on the project. “This is our attempt at understanding the impact of these [generative AI] models outside of their original intended purposes, with the hope that it will lead us to the next generation of tools and models.”

LucidSim uses a combination of generative AI models to create the visual training data. First the researchers generated thousands of prompts for ChatGPT, getting it to create descriptions of a range of environments that represent the conditions the robot would encounter in the real world, including different types of weather, times of day, and lighting conditions. These included “an ancient alley lined with tea houses and small, quaint shops, each displaying traditional ornaments and calligraphy” and “the sun illuminates a somewhat unkempt lawn dotted with dry patches.”

These descriptions were fed into a system that maps 3D geometry and physics data onto AI-generated images, creating short videos mapping a trajectory for the robot to follow. The robot draws on this information to work out the height, width, and depth of the things it has to navigate—a box or a set of stairs, for example.

The researchers tested LucidSim by instructing a four-legged robot equipped with a webcam to complete several tasks, including locating a traffic cone or soccer ball, climbing over a box, and walking up and down stairs. The robot performed consistently better than when it ran a system trained on traditional simulations. In 20 trials to locate the cone, LucidSim had a 100% success rate, versus 70% for systems trained on standard simulations. Similarly, LucidSim reached the soccer ball in another 20 trials 85% of the time, and just 35% for the other system.

Finally, when the robot was running LucidSim, it successfully completed all 10 stair-climbing trials, compared with just 50% for the other system.

From left: Phillip Isola, Ge Yang, and Alan Yu

These results are likely to improve even further in the future if LucidSim draws directly from sophisticated generative video models rather than a rigged-together combination of language, image, and physics models, says Phillip Isola, an associate professor at MIT who worked on the research.

The researchers’ approach to using generative AI is a novel one that will pave the way for more interesting new research, says Mahi Shafiullah, a PhD student at New York University who is using AI models to train robots. He did not work on the project.

“The more interesting direction I see personally is a mix of both real and realistic ‘imagined’ data that can help our current data-hungry methods scale quicker and better,” he says.

The ability to train a robot from scratch purely on AI-generated situations and scenarios is a significant achievement and could extend beyond machines to more generalized AI agents, says Zafeirios Fountas, a senior research scientist at Huawei specializing in brain‑inspired AI.

“The term ‘robots’ here is used very generally; we’re talking about some sort of AI that interacts with the real world,” he says. “I can imagine this being used to control any sort of visual information, from robots and self-driving cars up to controlling your computer screen or smartphone.”

In terms of next steps, the authors are interested in trying to train a humanoid robot using wholly synthetic data—which they acknowledge is an ambitious goal, as bipedal robots are typically less stable than their four-legged counterparts. They’re also turning their attention to another new challenge: using LucidSim to train the kinds of robotic arms that work in factories and kitchens. The tasks they have to perform require a lot more dexterity and physical understanding than running around a landscape.

“To actually pick up a cup of coffee and pour it is a very hard, open problem,” says Isola. “If we could take a simulation that’s been augmented with generative AI to create a lot of diversity and train a very robust agent that can operate in a café, I think that would be very cool.”