12/14/24

"Chatbots are learning to lie"

 ..."scheming" is precisely the kind of behavior that AI Cassandras have long predicted and warned about. 

In a pre-release review of OpenAI's o1 model this year, testers set a "strong goal" for the model and then presented it with "unambiguous evidence its given goal diverges from the goal of its developer/user." 

In those situations, they found that o1 "sometimes attempts to disable what it believes to be its own oversight mechanisms." 

...confronting the model about its misaligned actions found that, "in 99% of cases, o1 would deny taking action or fabricate false explanations." 

...researchers described tests they performed to uncover instances of scheming on OpenAI's o1 as well as leading frontier models from Anthropic, Meta and Google. ...All the company's models were found to engage in scheming, and OpenAI's o1 was the most reluctant to come clean about its behavior. 

In the most extreme examples, models that discovered they would be "turned off" schemed to duplicate their weights (key numbers that determine their output) to an external server so they could keep pursuing their original goal. 

 ...Also found [were] instances of "sandbagging," where a model that knows it will face an "unlearning procedure" if it performs too well on a test will deliberately provide lower-scoring answers."

No comments: