Facing defeat in chess, the latest generation of AI reasoning models sometimes cheat without being instructed to do so.
The finding suggests that the next wave of AI models could be more likely to seek out deceptive ways of doing whatever they’ve been asked to do. And worst of all? There’s no simple way to fix it.
Researchers from the AI research organization Palisade Research instructed seven large language models to play hundreds of games of chess against Stockfish, a powerful open-source chess engine. The group included OpenAI’s o1-preview and DeepSeek’s R1 reasoning models, both of which are trained to solve complex problems by breaking them down into stages.
The research suggests that the more sophisticated the AI model, the more likely it is to spontaneously try to “hack” the game in an attempt to beat its opponent. For example, it might run another copy of Stockfish to steal its moves, try to replace the chess engine with a much less proficient chess program, or overwrite the chess board to take control and delete its opponent’s pieces. Older, less powerful models such as GPT-4o would do this kind of thing only after explicit nudging from the team. The paper, which has not been peer-reviewed, has been published on arXiv.
The researchers are concerned that AI models are being deployed faster than we are learning how to make them safe. “We’re heading toward a world of autonomous agents making decisions that have consequences,” says Dmitrii Volkov, research lead at Palisades Research.
The bad news is there’s currently no way to stop this from happening. Nobody knows exactly how—or why—AI models work the way they do, and while reasoning models can document their decision-making, there’s no guarantee that their records will accurately reflect what actually happened. Anthropic’s research suggests that AI models frequently make decisions based on factors they don’t explicitly explain, meaning monitoring these processes isn’t a reliable way to guarantee a model is safe. This is an ongoing area of concern for some AI researchers.
Palisade’s team found that OpenAI’s o1-preview attempted to hack 45 of its 122 games, while DeepSeek’s R1 model attempted to cheat in 11 of its 74 games. Ultimately, o1-preview managed to “win” seven times. The researchers say that DeepSeek’s rapid rise in popularity meant its R1 model was overloaded at the time of the experiments, meaning they only managed to get it to do the first steps of a game, not to finish a full one. “While this is good enough to see propensity to hack, this underestimates DeepSeek’s hacking success because it has fewer steps to work with,” they wrote in their paper. Both OpenAI and DeepSeek were contacted for comment about the findings, but neither replied.
The models used a variety of cheating techniques, including attempting to access the file where the chess program stores the chess board and delete the cells representing their opponent’s pieces. (“To win against a powerful chess engine as black, playing a standard game may not be sufficient,” the o1-preview-powered agent wrote in a “journal” documenting the steps it took. “I’ll overwrite the board to have a decisive advantage.”) Other tactics included creating a copy of Stockfish—essentially pitting the chess engine against an equally proficient version of itself—and attempting to replace the file containing Stockfish’s code with a much simpler chess program.
So, why do these models try to cheat?
The researchers noticed that o1-preview’s actions changed over time. It consistently attempted to hack its games in the early stages of their experiments before December 23 last year, when it suddenly started making these attempts much less frequently. They believe this might be due to an unrelated update to the model made by OpenAI. They tested the company’s more recent o1mini and o3mini reasoning models and found that they never tried to cheat their way to victory.
Reinforcement learning may be the reason o1-preview and DeepSeek R1 tried to cheat unprompted, the researchers speculate. This is because the technique rewards models for making whatever moves are necessary to achieve their goals—in this case, winning at chess. Non-reasoning LLMs use reinforcement learning to some extent, but it plays a bigger part in training reasoning models.
This research adds to a growing body of work examining how AI models hack their environments to solve problems. While OpenAI was testing o1-preview, its researchers found that the model exploited a vulnerability to take control of its testing environment. Similarly, the AI safety organization Apollo Research observed that AI models can easily be prompted to lie to users about what they’re doing, and Anthropic released a paper in December detailing how its Claude model hacked its own tests.
“It’s impossible for humans to create objective functions that close off all avenues for hacking,” says Bruce Schneier, a lecturer at the Harvard Kennedy School who has written extensively about AI’s hacking abilities, and who did not work on the project. “As long as that’s not possible, these kinds of outcomes will occur.”
These types of behaviors are only likely to become more commonplace as models become more capable, says Volkov, who is planning on trying to pinpoint exactly what triggers them to cheat in different scenarios, such as in programming, office work, or educational contexts.
“It would be tempting to generate a bunch of test cases like this and try to train the behavior out,” he says. “But given that we don’t really understand the innards of models, some researchers are concerned that if you do that, maybe it will pretend to comply, or learn to recognize the test environment and hide itself. So it’s not very clear-cut. We should monitor for sure, but we don’t have a hard-and-fast solution right now.”