DeepMind says its new language model can beat others 25 times its size

In the two years since OpenAI released its language model GPT-3, most big-name AI labs have developed language mimics of their own. Google, Facebook, and Microsoft—as well as a handful of Chinese firms—have all built AIs that can generate convincing text, chat with humans, answer questions, and more.

Known as large language models because of the massive size of the neural networks underpinning them, they have become a dominant trend in AI, showcasing both its strengths—the remarkable ability of machines to use language—and its weaknesses, particularly AI’s inherent biases and the unsustainable amount of computing power it can consume.

Until now, DeepMind has been conspicuous by its absence. But this week the UK-based company, which has been behind some of the most impressive achievements in AI, including AlphaZero and AlphaFold, is entering the discussion with three large studies on language models. DeepMind’s main result is an AI with a twist: it’s enhanced with an external memory in the form of a vast database containing passages of text, which it uses as a kind of cheat sheet when generating new sentences.

Called RETRO (for “Retrieval-Enhanced Transformer”), the AI matches the performance of neural networks 25 times its size, cutting the time and cost needed to train very large models. The researchers also claim that the database makes it easier to analyze what the AI has learned, which could help with filtering out bias and toxic language.

“Being able to look things up on the fly instead of having to memorize everything can often be useful, in the same way as it is for humans,” says Jack Rae at DeepMind, who leads the firm’s research in large language models.

Language models generate text by predicting what words come next in a sentence or conversation. The larger a model, the more information about the world it can learn during training, which makes its predictions better. GPT-3 has 175 billion parameters—the values in a neural network that store data and get adjusted as the model learns. Microsoft’s language model Megatron has 530 billion parameters. But large models also take vast amounts of computing power to train, putting them out of reach of all but the richest organizations.

With RETRO, DeepMind has tried to cut the cost of training without reducing the amount the AI learns. The researchers trained the model on a vast data set of news articles, Wikipedia pages, books, and text from GitHub, an online code repository. The data set contains text in 10 languages, including English, Spanish, German, French, Russian, Chinese, Swahili, and Urdu.

RETRO’s neural network has only 7 billion parameters. But the system makes up for this with a database containing around 2 trillion passages of text. Both the database and the neural network are trained at the same time.

When RETRO generates text, it uses the database to look up and compare passages similar to the one it is writing, which makes its predictions more accurate. Outsourcing some of the neural network’s memory to the database lets RETRO do more with less.

The idea isn’t new, but this is the first time a look-up system has been developed for a large language model, and the first time the results from this approach have been shown to rival the performance of the best language AIs around.

Bigger isn’t always better

RETRO draws from two other studies released by DeepMind this week, one looking at how the size of a model affects its performance and one looking at the potential harms caused by these AIs.

To study size, DeepMind built a large language model called Gopher, with 280 billion parameters. It beat state-of-the-art models on 82% of the more than 150 common language challenges they used for testing. The researchers then pitted it against RETRO and found that the 7-billion-parameter model matched Gopher’s performance on most tasks.

The ethics study is a comprehensive survey of well-known problems inherent in large language models. These models pick up biases, misinformation, and toxic language such as hate speech from the articles and books they are trained on. As a result, they sometimes spit out harmful statements, mindlessly mirroring what they have encountered in the training text without knowing what it means. “Even a model that perfectly mimicked the data would be biased,” says Rae.

According to DeepMind, RETRO could help address this issue because it is easier to see what the AI has learned by examining the database than by studying the neural network. In theory, this could allow examples of harmful language to be filtered out or balanced with non-harmful examples. But DeepMind has not yet tested this claim. “It’s not a fully resolved problem, and work is ongoing to address these challenges,” says Laura Weidinger, a research scientist at DeepMind.

The database can also be updated without retraining the neural network. This means that new information, such as who won the US Open, can be added quickly—and out-of-date or false information removed.

Systems like RETRO are more transparent than black-box models like GPT-3, says Devendra Sachan, a PhD student at McGill University in Canada. “But this is not a guarantee that it will prevent toxicity and bias.” Sachan developed a forerunner of RETRO in a previous collaboration with DeepMind, but he was not involved in this latest work.

For Sachan, fixing the harmful behavior of language models requires thoughtful curation of the training data before training begins. Still, systems like RETRO may help: “It’s easier to adopt these guidelines when a model makes use of external data for its predictions,” he says.

DeepMind may be late to the debate. But rather than leapfrogging existing AIs, it is matching them with an alternative approach. “This is the future of large language models,” says Sachan.