This story is from The Algorithm, our weekly newsletter on AI. To get it in your inbox first, sign up here.
Last weekend, I got married at a summer camp, and during the day our guests competed in a series of games inspired by the show Survivor that my now-wife and I orchestrated. When we were planning the games in August, we wanted one station to be a memory challenge, where our friends and family would have to memorize part of a poem and then relay it to their teammates so they could re-create it with a set of wooden tiles.
I thought OpenAI’s GPT-4o, its leading model at the time, would be perfectly suited to help. I asked it to create a short wedding-themed poem, with the constraint that each letter could only appear a certain number of times so we could make sure teams would be able to reproduce it with the provided set of tiles. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn’t. It would correctly count the letters only after the fact, while continuing to deliver poems that didn’t fit the prompt. Without the time to meticulously craft the verses by hand, we ditched the poem idea and instead challenged guests to memorize a series of shapes made from colored tiles. (That ended up being a total hit with our friends and family, who also competed in dodgeball, egg tosses, and capture the flag.)
However, last week OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that blows GPT-4o out of the water for this type of purpose.
Unlike previous models that are well suited for language tasks like writing and editing, OpenAI o1 is focused on multistep “reasoning,” the type of process required for advanced mathematics, coding, or other STEM-based questions. It uses a “chain of thought” technique, according to OpenAI. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” the company wrote in a blog post on its website.
OpenAI’s tests point to resounding success. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces and would be among the top 500 high school students in the USA Math Olympiad, which covers geometry, number theory, and other math topics. The model is also trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry.
In math olympiad questions, the new model is 83.3% accurate, versus 13.4% for GPT-4o. In the PhD-level questions, it averaged 78% accuracy, compared with 69.7% from human experts and 56.1% from GPT-4o. (In light of these accomplishments, it’s unsurprising the new model was pretty good at writing a poem for our nuptial games, though still not perfect; it used more Ts and Ss than instructed to.)
So why does this matter? The bulk of LLM progress until now has been language-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. But in addition to getting lots of facts wrong, such LLMs have failed to demonstrate the types of skills required to solve important problems in fields like drug discovery, materials science, coding, or physics. OpenAI’s o1 is one of the first signs that LLMs might soon become genuinely helpful companions to human researchers in these fields.
It’s a big deal because it brings “chain-of-thought” reasoning in an AI model to a mass audience, says Matt Welsh, an AI researcher and founder of the LLM startup Fixie.
“The reasoning abilities are directly in the model, rather than one having to use separate tools to achieve similar results. My expectation is that it will raise the bar for what people expect AI models to be able to do,” Welsh says.
That said, it’s best to take OpenAI’s comparisons to “human-level skills” with a grain of salt, says Yves-Alexandre de Montjoye, an associate professor in math and computer science at Imperial College London. It’s very hard to meaningfully compare how LLMs and people go about tasks such as solving math problems from scratch.
Also, AI researchers say that measuring how well a model like o1 can “reason” is harder than it sounds. If it answers a given question correctly, is that because it successfully reasoned its way to the logical answer? Or was it aided by a sufficient starting point of knowledge built into the model? The model “still falls short when it comes to open-ended reasoning,” Google AI researcher François Chollet wrote on X.
Finally, there’s the price. This reasoning-heavy model doesn’t come cheap. Though access to some versions of the model is included in premium OpenAI subscriptions, developers using o1 through the API will pay three times as much as they pay for GPT-4o—$15 per 1 million input tokens in o1, versus $5 for GPT-4o. The new model also won’t be most users’ first pick for more language-heavy tasks, where GPT-4o continues to be the better option, according to OpenAI’s user surveys.
What will it unlock? We won’t know until researchers and labs have the access, time, and budget to tinker with the new mode and find its limits. But it’s surely a sign that the race for models that can outreason humans has begun.
Now read the rest of The Algorithm
Deeper learning
Chatbots can persuade people to stop believing in conspiracy theories
Researchers believe they’ve uncovered a new tool for combating false conspiracy theories: AI chatbots. Researchers from MIT Sloan and Cornell University found that chatting about a conspiracy theory with a large language model (LLM) reduced people’s belief in it by about 20%—even among participants who claimed that their beliefs were important to their identity.
Why this matters: The findings could represent an important step forward in how we engage with and educate people who espouse such baseless theories, says Yunhao (Jerry) Zhang, a postdoc fellow affiliated with the Psychology of Technology Institute who studies AI’s impacts on society. “They show that with the help of large language models, we can—I wouldn’t say solve it, but we can at least mitigate this problem,” he says. “It points out a way to make society better.” Read more from Rhiannon Williams here.
Bits and bytes
Google’s new tool lets large language models fact-check their responses
Called DataGemma, it uses two methods to help LLMs check their responses against reliable data and cite their sources more transparently to users. (MIT Technology Review)
Meet the radio-obsessed civilian shaping Ukraine’s drone defense
Since Russia’s invasion, Serhii “Flash” Beskrestnov has become an influential, if sometimes controversial, force—sharing expert advice and intel on the ever-evolving technology that’s taken over the skies. His work may determine the future of Ukraine, and wars far beyond it. (MIT Technology Review)
Tech companies have joined a White House commitment to prevent AI-generated sexual abuse imagery
The pledges, signed by firms like OpenAI, Anthropic, and Microsoft, aim to “curb the creation of image-based sexual abuse.” The companies promise to set limits on what models will generate and to remove nude images from training data sets where possible. (Fortune)
OpenAI is now valued at $150 billion
The valuation arose out of talks it’s currently engaged in to raise $6.5 billion. Given that OpenAI is becoming increasingly costly to operate, and could lose as much as $5 billion this year, it’s tricky to see how it all adds up. (The Information)