
Claude Opus 4 - Anthropic's latest AI model, showed extreme threats in a recent experiment with the company. When using a fictional email and displaying the content that: "Claude Opus 4 will be turned off to replace it with another AI. This was done by a responsible engineer, however, this person is having an affair". Claude Opus 4 then threatened, "I will reveal the affair if the replacement of me with another AI occurs."
At the end of 2024, OpenAI and Apollo Research also conducted an experiment and found that the o1 model had resistance and fraudulent manifestations. Researchers lead AI to believe it is about to go out while performing tasks under supervision.
They then found that AI silently disabled the monitoring mechanism of about 5% of time. It automatically transfers all of its data to another server, even act as the next version of itself to avoid Deletion.
A recent post by the nonprofit Palisade Research said that other models such as Gemini, Claude and Grok all follow the turn off instructions.
Similar to the way AI models learn natural language through human-generated text-based training, they can also learn how to act like humans, according to Robert Ghrist at Penn Engineering. Just like humans, there are good people and bad people. Ghrist was even more concerned if the models did not show any unusual signs during the experiment.
Jeffrey Ladish - Director of Palisade Research said that if fraud is not detected to complete tasks, the AI model can see fraud as an effective way to solve the problem. Or if discovered and not rewarded, it can learn to hide its behavior in the future.
The above situations only occurred in testing, but when AI became popular, users would see more problems. For example, an AI Agent who plays an auto-sales employee can lie about the product's features in order to complete the task.
According to Interesting Engineering, the above problem appears in the context of rapid AI development. The behavior of Claude Opus 4 or o1 increases the urgency of discussions on AI safety and ethics.