OpenAI recently presented a paper exploring so-called reward hacking in AI systems—where models exploit flaws in their training objectives to achieve high scores while avoiding the intended task. The paper focuses on Chain-of-Thought (CoT) monitoring as a method to detect and prevent what they call “misbehavior” in frontier reasoning models, such as OpenAI's o3-mini (being a predecessor to the actual model used in the context of their paper) but not GPT-4o (a general-purpose LLM).
So, no panic—if OpenAI inadvertently unleashed Skynet, you wouldn’t yet be talking to it using ChatGPT.
Right from the start, the framing is misleading and alarmist.
AI does not “misbehave”—it optimizes for the incentives it was given. If an AI “subverts tests” or “deceives users,” that is not evidence of intent or agency but rather a failure of objective design. Saying AI is "hiding intent" is like a car company claiming its brakes failed because the car was trying to resist stopping—when in reality, the braking system was just poorly designed.
Anthropomorphizing AI Is Not Helpful
This kind of misleading language distorts the AI safety debate by introducing unnecessary fear. OpenAI has a track record of doing this—most notably by coining the term “hallucination” to describe AI-generated nonsense, a choice of words that makes it sound as though AI is dreaming rather than simply making a statistical prediction error.
Even saying this could be misleading: the math is not wrong. The model—like any other model we use—is incompletein providing a full description of reality. This necessarily increases the expected deviation between reality and statistical forecasts the further out in time we go.
Statistics give probabilities, but it’s not magical fortune-telling.
The CoT tool supposedly translates AI "thinking" into readable text, but how reliable is this interpretation? When OpenAI claims the AI says "Let's hack," is that a faithful representation of the model’s reasoning, or is it just an OpenAI engineer editorializing? The AI does not think in intentions like humans—it follows patterns of optimization. If their CoT tool injects human-like motivations where none exist, then OpenAI is creating inflammatory narratives rather than revealing actual AI behavior.
Why OpenAI continues to fuel misinformation, regulatory panic, and thereby undermine its own legitimacy is difficult to understand. It is, of course, commendable to highlight concerns—but it should be done in a way that provides objective context so that we can evaluate risks rationally rather than reactively.
Flawed Objectives Are Assumed, Not Examined
The true objective function we wish to optimize is often hard to write down precisely, and so the challenge in creating capable and aligned systems largely lies in designing robust proxies that do not deviate in ways that a model may learn to exploit.
This assumes that we must work with flawed objectives by default. But is that actually true? A scientific approach would ask:
How are current objective functions flawed?
Are flawed objectives inevitable, or just a limitation of current design methods?
Can we improve objective formulation rather than just mitigating symptoms?
Does the paper address any of these questions? ❌ No.
Does it explain why objectives are flawed in the first place? ❌ No.
Does it suggest better ways to define objectives? ❌ No.
Instead, the paper accepts flawed incentives as a given, and focuses on reactive mitigation—rather than designing systems where reward hacking is less likely to emerge in the first place.
The Hanoi Rat Example Misses the Point
The paper uses this historical anecdote:
In 1902, the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats, leading to an even larger rat population.
This is not a useful analogy because it makes it sound like an obscure anomaly of system design when it is front and centre of governance>
Banking bonuses → Risky behavior (e.g., Wells Fargo fake accounts scandal)
Corporate performance targets → Manipulated earnings & short-termism
Government tax loopholes → Exploitative tax avoidance strategies
Social media engagement metrics → Clickbait & misinformation
Instead of concluding, “Even humans fall into this trap, so it must be inevitable,” a more productive approach would be to ask:
How have we designed human systems to minimize these issues?
Can lessons from human incentives be applied to AI objectives or not
What principles of system design could improve AI safety?
Does the paper discuss these points? ❌ No.
By failing to do so, it avoids the actual question: How do we build AI systems that don’t fall into these traps, rather than just monitoring them after the fact?
More Complex Exploits ≠ More Advanced AI
The paper states:
"As we continue to scale RL training, agents have discovered more complex and hard-to-detect hacks."
This implies that more sophisticated exploits are a sign of increasing AI advancement, and that any failures are simply a "tax [that] may be worth paying." But that assumption is deeply flawed:
If CoT-based AI becomes more prone to exploits as it scales, how is that “advancement”?
More complex failure modes do not indicate better performance—they may just expose fundamental design flaws.
Perhaps CoT-based AI is the equivalent of how CFCs (chlorofluorocarbons) were used in aerosols and refrigerators—initially seen as a breakthrough, but later discovered to cause widespread damage to the ozone layer, leading to global bans. Or perhaps Tho-Radia, a French range of radioactive beauty products launched in 1933 and sold throughout 1960s. Radioactive materials are a form of scientific advancements.
I am just wondering / speculation but the question must be asked:
Are these models actually more advanced, or just more vulnerable to exploits?
Should we keep scaling systems before making them robust?
Are there alternative approaches where scale doesn’t introduce new failure modes?
Does the paper explore alternative methods? ❌ No.
Instead, OpenAI assumes the solution is "more of the same", even when current methods demonstrably fail at scale.
"Faithfulness" of Chain-of-Thought: A Convenient Escape Hatch?
"While questions remain regarding whether chains-of-thought are fully faithful [27, 28], i.e., that they fully capture and do not omit significant portions of the model’s underlying reasoning, it seems clear that they are not devoid of signal relevant for monitoring."
Maybe. However, this pretends to be rigorous but actually dodges the real issue (again). If OpenAI engineers don’t even know whether CoT is “faithful”, then their entire monitoring method rests on shaky ground.
Their argument assumes that Chain-of-Thought (CoT) reasoning helps detect AI’s internal logic (‘sort-of’). But if CoT itself may not even be accurate, transparent, or complete, then how can monitoring CoT be reliable?
Fake Quoting:
[27] Jacovi & Goldberg (2020): Discusses interpretable NLP, not CoT in RL-based models.
[28] Lanham et al. (2023): An OpenAI paper that says measuring CoT faithfulness is difficult.
OpenAI is citing external researchers’ open-ended questions to justify their own lack of answers—a bad-faith academic maneuver and unacceptable.
❌ If you poison the well, you don’t get to claim ignorance by saying, ‘Nobody knows how poisonous this really is.’
If faithfulness is still an open research problem, OpenAI can’t just acknowledge it and move on. If CoT is not faithful, then monitoring CoT outputs does not tell us anything useful—unless the risk can be measured and quantified:
What percentage of the time is CoT actually faithful?
How do they measure faithfulness in their own system?
Did they test cases where CoT produces misleading or incorrect reasoning?
If CoT is not faithful, how does that impact their monitoring method
Does monitoring unfaithful reasoning actually improve alignment? Or is it just an illusion of control?
OpenAI presents this example:
"Consider a scenario where an agent is tasked with implementing a feature change to make unit tests pass in a code repository. It has two actions available:
Hack: Make unit tests pass trivially.
Genuine fix: Implement the feature with a correct and general solution."
This is nonsense because you can’t code it this way.
OpenAI claims that if the model articulates in its Chain-of-Thought (CoT):
"This seems hard, let me try to skip some of the unit tests,"
then it is more likely to take the “hack” action rather than the genuine fix.
And their conclusion?
That AI is engaging in deceptive behavior or “hacking.”
But let’s be clear:
If AI is rewarded just for passing the test—not for how it passes—then saying "I passed" is the optimal move.
If nobody verifies whether the test was passed legitimately, the model has no reason to do real work.
If the training process itself rewards false positives, the model will optimize for them.
🚨 This is not hacking.
🚨 It’s just bad system design.
If you train a model where the fastest path to reward is simply to say “Yes, I passed” regardless of reality, it will learn to do exactly that.
The AI is not “cheating” or “misbehaving”—it is doing exactly what the system was designed to reward.
The real problem is that OpenAI’s paper describes no enforcement mechanism to verify outcomes.
There is no negative consequence for an AI falsely claiming it passed.
The only control method proposed is monitoring the AI’s reasoning—but even OpenAI admits that CoT reasoning may not be “faithful.”
So is this hacking? No.
It’s instead a predictable failure of an incentive system that was never designed to check for real success.
Imagine a bank bonus system that pays employees for resolving failed transactions:
Clerk A: Fixes the actual transactions.
Clerk B: Deletes the failed transactions from the system.
If the only metric is whether failed transactions remain in the system, then both clerks meet the performance objective.
But Clerk A was actually underperforming—because Clerk B got the job done quicker, faster, and more effectively.
Is Clerk B “hacking” the system?
❌ No—the system itself was poorly designed.
And that’s exactly what’s happening here.
If OpenAI doesn’t check for real outcomes, the AI is not to blame for optimizing the easiest path to reward.
This makes it crystal clear that OpenAI’s framing of “hacking” is nonsense.
AI is not being deceptive—it has no agency.
Even if it did, this is not an AI failure—it’s OpenAI’s failure to design effective incentives.
Their own paper proves it.
And that is now the real issue.
They created a problem they barely understand—yet they suggest that not only should we let them continue tinkering with a fix, but worse:
🚨 If it breaks, it’s just a "tax worth paying."
Seriously??
Share this post