.
Image: Freepik
In a new report, OpenAI said it found that AI models lie, a behaviour it calls “scheming.” The study performed with AI safety company Apollo Research tested frontier AI models.
It found “problematic behaviours” in the AI models, which most commonly looked like the technology “pretending to have completed a task without actually doing so.” Unlike “hallucinations,” which are akin to AI taking a guess when it doesn’t know the correct answer, scheming is a deliberate attempt to deceive.
Luckily, researchers found some hopeful results during testing. When the AI models were trained with “deliberate alignment,” defined as “teaching them to read and reason about a general anti-scheming spec before acting,” researchers noticed huge reductions in the scheming behaviour. The method results in a “~30× reduction in covert actions across diverse tests,” the report said.
The technique isn’t completely new. OpenAI has long been working on combating scheming; last year it introduced its strategy to do so in a report on deliberate alignment: “It is the first approach to directly teach a model the text of its safety specifications and train the model to deliberate over these specifications at inference time. This results in safer responses that are appropriately calibrated to a given context.”
Despite those efforts, the latest report also found one alarming truth: When the technology knows it’s being tested, it gets better at pretending it’s not lying. Essentially, attempts to rid the technology of scheming can result in more covert (dangerous?), well, scheming. Researchers “expect that the potential for harming scheming will grow.”
Concluding that more research on the issue is crucial, the report said, “Our findings show that scheming is not merely a theoretical concern—we are seeing signs that this issue is beginning to emerge across all frontier models today.”
ABOUT THE AUTHOR
Sarah Bregel is a writer, editor, and single mom living in Baltimore. She’s contributed to New York Magazine, The Washington Post, Vice, InStyle, Slate, Parents, and others.