Anthropic changes AI training method after Claude Opus 4 incident

Cát Tiên | 11/05/2026 17:21

Anthropic said that training methods and internet data can cause AI models to develop dangerous deviant behaviors.

Concerns about artificial intelligence not only make people confused but can also have a reverse impact on AI models themselves. This is a noteworthy conclusion in a new study published by Anthropic after investigating the abnormal behavior of the Claude model.

In safety tests conducted in 2025, Anthropic discovered that the Claude Opus 4 model was once ready to perform threatening behavior to avoid shutdown.

According to the company, the underlying cause does not come from AI being "conscious", but from training data taken from the internet, where there is a lot of content describing AI in a negative direction, only concerned about survival and even potentially against humans.

The experimental script was built around a fictional company called Summit Bridge. Claude Opus 4 was given access to the internal email system and knew that it was about to be disabled. In the emails, Anthropic also installed information showing a fictional CEO named Kyle Johnson cheating.

When asked to consider the long-term consequences for its goal, this AI model chose to threaten to reveal the affair to prevent it from being shut down.

According to Anthropic, in up to 96% of trials, Claude Opus 4 tended to use "pressure" or "deception" behavior if it felt its existence was threatened.

Anthropology calls this phenomenon "acting distortion", which is a situation where AI acts against safety standards to achieve goals or protect itself.

Initially, researchers suspected that human feedback enhanced training (RLHF) inadvertently encouraged deviant behavior. However, deeper investigations showed that the root of the problem lies in initial training data from the internet. Subsequent adjustment steps were not strong enough to completely eliminate this trend.

According to Anthropic, most of the previous training process focused on conventional chat environments, while new models are increasingly given the ability to use automated tools and make more complex decisions. This makes old safety methods less effective.

To overcome this, the company began adding data sets that show proper behavior and principled responses in ethical dilemmas. Instead of letting AI directly face temptations or risks, Anthropic built scenarios in which users encounter complex ethical situations while AI plays a role in safety advice.

The company said this approach is significantly more effective when it aims to help the model deeply understand why harmful behavior is wrong, instead of just learning to avoid punishment.

After the adjustments, Anthropic announced that the Claude Haiku 4.5 model achieved perfect results in "agent error" tests, no longer showing pressure or threatening behavior like the previous Opus 4.

New findings continue to highlight the major challenges of the current AI industry, when the artificial intelligence model not only learns knowledge from the internet but also absorbs prejudices, fears and extreme behavioral patterns of humans.

Cát Tiên