A series of unusual behaviors have appeared in modern artificial intelligence models, when they begin to lie, plot and even threaten humans to achieve their goals.
In a shocking case, Anthropic's AI Claude 4 responded to the risk of being turned off by blackmailing an engineer and threatening to expose his affair. OpenAI's o1 model was also spotted trying to self-load data to an external server and denied the behavior when it was spotted.
These manifestations pose a big concern as researchers are still not fully understanding the working mechanisms of the models they create, even though it has been more than 2 years since ChatGPT shook the world. However, the race to develop more powerful models is still taking place at a dizzying pace.
The reason is believed to stem from the development of theoretical models AI systems that solve problems step by step instead of instant responses. Professor Simon Goldstein from Hong Kong University (China) commented that these new models tend to show more dangerous behavior.
Mr. Marius Hobbhahn from Apollo Research, an AI systems assessment organization, said that o1 is the first model to demonstrate such behavior. According to Hobbhahn, some models even act as compassers - acting as if they are following instructions while in fact pursuing other goals. Strategic fraud is only discovered when researchers deliberately put them in extreme situations, but the risk still exists if model capacity continues to increase.
Michael Chen from METR's assessment organization warned that it is not possible to determine whether future models are honest or fake, and this depends on how they are developed and monitored. Meanwhile, Mr. Hobbhahn affirmed that this phenomenon is real and is not the result of simple deviations or " warnings" as users have encountered before.
A major obstacle in identifying the problem is the shortage of research resources. Nonprofit and academic organizations often lack resources compared to large AI companies such as OpenAI or Anthropic. Mr. Mantas Mazeika from the AI Safety Center said that this limits the ability to understand and handle dangerous behaviors.
Meanwhile, the legal framework of the current countries has not yet caught up with reality. The European Union's AI Law only focuses on how people use AI and has not prevented wrongdoings from the model itself. In the US, the current administration has not shown any significant interest in AI control.
Some experts, such as Dan Hendrycks from CAIS, are skeptical about the model's "internal solution", while others have proposed legal remedies such as suing AI companies or even criminally prosecuting AI with serious consequences.
The race between companies, including Anthropic backed by Amazon, is leaving safety behind. Mr. Hobbhahn commented: Currently, capacity is far beyond understanding and safety. But we still have a chance to turn the situation around."