Google DeepMind’s new Gemini model looks amazing—but could signal peak AI hype

Hype about Gemini, Google DeepMind’s long-rumored response to OpenAI’s GPT-4, has been building for months. Today the company finally revealed what it has been working on in secret all this time. Was the hype justified? Yes—and no.

Gemini is Google’s biggest AI launch yet—its push to take on competitors OpenAI and Microsoft in the race for AI supremacy. There is no doubt that the model is pitched as best-in-class across a wide range of capabilities—an “everything machine,” as one observer puts it.

“The model is innately more capable,” Sundar Pichai, the CEO of Google and its parent company Alphabet, told MIT Technology Review. “It’s a platform. AI is a profound platform shift, bigger than web or mobile. And so it represents a big step for us.”

It’s a big step for Google, but not necessarily a giant leap for the field as a whole. Google DeepMind claims that Gemini outmatches GPT-4 on 30 out of 32 standard measures of performance. And yet the margins between them are thin. What Google DeepMind has done is pull AI’s best current capabilities into one powerful package. To judge from demos, it does many things very well—but few things that we haven’t seen before. For all the buzz about the next big thing, Gemini could be a sign that we’ve reached peak AI hype. At least for now.

Chirag Shah, a professor at the University of Washington who specializes in online search, compares the launch to Apple’s introduction of a new iPhone every year. “Maybe we just have risen to a different threshold now, where this doesn’t impress us as much because we’ve just seen so much,” he says.

Like GPT-4, Gemini is multimodal, meaning it is trained to handle multiple kinds of input: text, images, audio. It can combine these different formats to answer questions about everything from household chores to college math to economics.

In a demo for journalists yesterday, Google showed Gemini’s ability to take an existing screenshot of a chart, analyze hundreds of pages of research with new data, and then update the chart with that new information. In another example, Gemini is shown pictures of an omelet cooking in a pan and asked (using speech, not text) if the omelet is cooked yet. “It’s not ready because the eggs are still runny,” it replies.

Most people will have to wait for the full experience, however. The version launched today is a back end to Bard, Google’s text-based search chatbot, which the company says will give it more advanced reasoning, planning, and understanding capabilities. Gemini’s full release will be staggered over the coming months. The new Gemini-boosted Bard will initially be available in English in more than 170 countries, not including the EU and the UK. This is to let the company “engage” with local regulators, says Sissie Hsiao, a Google vice president in charge of Bard.

Gemini also comes in three sizes: Ultra, Pro and Nano. Ultra is the full-powered version; Pro and Nano are tailored to applications that run with more limited computing resources. Nano is designed to run on devices, such as Google’s new Pixel phones. Developers and businesses will be able to access Gemini Pro starting December 13. Gemini Ultra, the most powerful model, will be available “early next year” following “extensive trust and safety checks,” Google executives told reporters on a press call.

“I think of it as the Gemini era of models,” Pichai told us. “This is how Google DeepMind is going to build and make progress on AI. So it will always represent the frontier of where we are making progress on AI technology.”

Bigger, better, faster, stronger?

OpenAI’s most powerful model, GPT-4, is seen as the industry’s gold standard. While Google boasted that Gemini outperforms OpenAI’s previous model, GPT 3.5, company executives dodged questions about how far the model exceeds GPT-4.

But the firm highlights one benchmark in particular, called MMLU (massive multitask language understanding). This is a set of tests designed to measure the performance of models on tasks involving text and images, including reading comprehension, college math, and multiple-choice quizzes in physics, economics, and social sciences. On the text-only questions, Gemini scores 90% and human experts score approximately 89%, says Pichai. GPT-4 scores 86% on these types of questions. On the multimodal questions, Gemini scores 59%, while GPT-4 scores 57%. “It’s the first model to cross that threshold,” Pichai says.

Gemini’s performance against benchmark data sets is very impressive, says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico.

“It’s clear that Gemini is a very sophisticated AI system,” says Mitchell. But “it’s not obvious to me that Gemini is actually substantially more capable than GPT-4,” she adds.

While the model has good benchmark scores, it is hard to know how to interpret these numbers given that we don’t know what’s in the training data, says Percy Liang, director of Stanford’s Center for Research on Foundation Models.

Mitchell also notes that Gemini performs much better on language and code benchmarks than on images and video. “Multimodal foundation models still have a ways to go to be generally and robustly useful for many tasks,” she says.

Using feedback from human testers, Google DeepMind has trained Gemini to be more factually accurate, to give attribution when asked to, and to hedge rather than spit out nonsense when faced with a question it cannot answer. The company claims that this mitigates the problem of hallucinations. But without a radical overhaul of the base technology, large language models will continue to make things up.

Experts say it’s unclear whether the benchmarks Google is using to measure Gemini’s performance offer that much insight, and without transparency, it’s hard to check Google’s claims.

“Google is advertising Gemini as an everything machine—a general-purpose model that can be used in many different ways,” says Emily Bender, a professor of computational linguistics at the University of Washington. But the company is using narrow benchmarks to evaluate models that it expects to be used for these diverse purposes. “This means it effectively can’t be thoroughly evaluated,” she says.

Ultimately, for the average user, the incremental improvement over competing models might not make much difference, says Shah. “It’s more about convenience, brand recognition, existing integration, than people really thinking ‘Oh, this is better,’” he says.

A long, slow buildup

Gemini has been a long time coming. In April 2023, Google announced it was merging its AI research unit Google Brain with DeepMind, Alphabet’s London-based AI research lab. So Google has had all year to develop its answer to OpenAI’s most advanced large language model, GPT-4, which debuted in March and is the backbone of the paid version of ChatGPT.

Google has been under intense pressure to show investors it can match and overtake competitors in AI. Although the company has been developing and using powerful AI models for years, it has been hesitant to launch tools that the public can play with for fears of reputational damage and safety concerns.

“Google has been very cautious about releasing this stuff to the public,” Geoffrey Hinton told MIT Technology Review in April when he left the company. “There are too many bad things that could happen, and Google didn’t want to ruin its reputation.” Faced with tech that seemed untrustworthy or unmarketable, Google played it safe—until the greater risk became missing out.

Google has learned the hard way how launching flawed products can backfire. When it unveiled its ChatGPT competitor Bard in February, scientists soon noticed a factual error in the company’s own advertisement for the chatbot, an incident that subsequently wiped $100 billion off its share price.

In May, Google announced it was rolling out generative AI into most of its products, from email to productivity software. But the results failed to impress critics: the chatbot made references to emails that didn’t exist, for example.

This is a consistent problem with large language models. Although excellent at generating text that sounds like something a human could have written, generative AI systems regularly make things up. And that’s not the only problem with them. They are also easy to hack, and riddled with biases. Using them is also highly polluting.

Google has solved neither these problems nor the hallucination issue. Its solution to the latter problem is a tool that lets people use Google search to double-check the chatbot’s answers, but that relies on the accuracy of the online search results themselves.

Gemini may be the pinnacle of this wave of generative AI. But it’s not clear where AI built on large language models goes next. Some researchers believe this could be a plateau rather than the foot of the next peak.

Pichai is undeterred. “Looking ahead, we do see a lot of headroom,” he says. “I think multimodality will be big. As we teach these models to reason more, there will be bigger and bigger breakthroughs. Deeper breakthroughs are to come yet.

“When I take in the totality of it, I genuinely feel like we are at the very beginning.”

Mat Honan contributed reporting.