Deep Science: Combining vision and language could be the key to more capable AI

Depending on the theory of intelligence to which you subscribe, achieving “human-level” AI will require a system that can leverage multiple modalities — e.g., sound, vision and text — to reason about the world. For example, when shown an image of a toppled truck and a police cruiser on a snowy freeway, a human-level AI might infer that dangerous road conditions caused an accident. Or, running on a robot, when asked to grab a can of soda from the refrigerator, they’d navigate around people, furniture and pets to retrieve the can and place it within reach of the requester.

Today’s AI falls short. But new research shows signs of encouraging progress, from robots that can figure out steps to satisfy basic commands (e.g., “get a water bottle”) to text-producing systems that learn from explanations. In this revived edition of Deep Science, our weekly series about the latest developments in AI and the broader scientific field, we’re covering work out of DeepMind, Google and OpenAI that makes strides toward systems that can — if not perfectly understand the world — solve narrow tasks like generating images with impressive robustness.

AI research lab OpenAI’s improved DALL-E, DALL-E 2, is easily the most impressive project to emerge from the depths of an AI research lab. As my colleague Devin Coldewey writes, while the original DALL-E demonstrated a remarkable prowess for creating images to match virtually any prompt (for example, “a dog wearing a beret”), DALL-E 2 takes this further. The images it produces are much more detailed, and DALL-E 2 can intelligently replace a given area in an image — for example inserting a table into a photo of a marbled floor replete with the appropriate reflections.

An example of the types of images DALL-E 2 can generate.

DALL-E 2 received most of the attention this week. But on Thursday, researchers at Google detailed an equally impressive visual understanding system called Visually-Driven Prosody for Text-to-Speech — VDTTS — in a post published to Google’s AI blog. VDTTS can generate realistic-sounding, lip-synced speech given nothing more than text and video frames of the person talking.

VDTTS’ generated speech, while not a perfect stand-in for recorded dialogue, is still quite good, with convincingly human-like expressiveness and timing. Google sees it one day being used in a studio to replace original audio that might’ve been recorded in noisy conditions.

Of course, visual understanding is just one step on the path to more capable AI. Another component is language understanding, which lags behind in many aspects — even setting aside AI’s well-documented toxicity and bias issues. In a stark example, a cutting-edge system from Google, Pathways Language Model (PaLM), memorized 40% of the data that was used to “train” it, according to a paper, resulting in PaLM plagiarizing text down to copyright notices in code snippets.

Fortunately, DeepMind, the AI lab backed by Alphabet, is among those exploring techniques to address this. In a new study, DeepMind researchers investigate whether AI language systems — which learn to generate text from many examples of existing text (think books and social media) — could benefit from being given explanations of those texts. After annotating dozens of language tasks (e.g., “Answer these questions by identifying whether the second sentence is an appropriate paraphrase of the first, metaphorical sentence”) with explanations (e.g., “David’s eyes were not literally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.”) and evaluating different systems’ performance on them, the DeepMind team found that examples indeed improve the performance of the systems.

DeepMind’s approach, if it passes muster within the academic community, could one day be applied in robotics, forming the building blocks of a robot that can understand vague requests (e.g., “throw out the garbage”) without step-by-step instructions. Google’s new “Do As I Can, Not As I Say” project gives a glimpse into this future — albeit with significant limitations.

A collaboration between Robotics at Google and the Everyday Robotics team at Alphabet’s X lab, Do As I Can, Not As I Say seeks to condition an AI language system to propose actions “feasible” and “contextually appropriate” for a robot, given an arbitrary task. The robot acts as the language system’s “hands and eyes” while the system supplies high-level semantic knowledge about the task — the theory being that the language system encodes a wealth of knowledge useful to the robot.

Image Credits: Robotics at Google

A system called SayCan selects which skill the robot should perform in response to a command, factoring in (1) the probability a given skill is useful and (2) the possibility of successfully executing said skill. For example, in response to someone saying “I spilled my coke, can you bring me something to clean it up?,” SayCan can direct the robot to find a sponge, pick up the sponge, and bring it to the person who asked for it.

SayCan is limited by robotics hardware — on more than one occasion, the research team observed the robot that they chose to conduct experiments accidentally dropping objects. Still, it, along with DALL-E 2 and DeepMind’s work in contextual understanding, is an illustration of how AI systems when combined can inch us that much closer to a Jetsons-type future.