According to OpenAI's technical report, the newly launched o3 and o4-mini models have a higher rate of creating misinformation than previous models of reasoning such as o1, o1-mini and o3-mini, as well as traditional AI models such as GPT-4o.
In the PersonQA internal test, o3 generated false information in 40% of the questions, double the o1 and o3-mini. More worryingly, o4-mini is even wrong in 48% of cases.
OpenAI admitted that it is unclear why new models are more "alargic". Initial theories suggest that the current extended learning method may accidentally amplify the problem.
However, o3 still shows superiority in some areas such as programming and toancture. Many development groups are testing the integration of o3 into the workplace, but warn that AI sometimes creates corrupt leads or leads to non-existent information.
The problem of "alarm" makes it difficult for businesses, especially in fields that require high accuracy such as law, to apply AI. A good solution is to integrate a web search function, such as GPT-4o, which currently has reached 90% accuracy on some tests.
OpenAI said it is continuing research to reduce the phenomenon of "alarm" on all of its AI lines. In the context of the entire AI industry shifting to theoretical models, "vehicle" control is becoming an urgent challenge.