OpenAI says GPT-5.4 Thinking still struggles to hide its chain of thought
Original: We're publishing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. We find that GPT-5.4 Thinking shows low ability to obscure its reasoning-suggesting CoT monitoring remains a useful safety tool. https://t.co/isZkNkPXZm View original →
What OpenAI released
On March 5, 2026, OpenAI said in an X post that it is releasing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. The linked OpenAI research page argues that GPT-5.4 Thinking still has low ability to deliberately reshape or hide its own reasoning traces. In OpenAI's interpretation, that is reassuring for safety work because CoT monitoring remains a useful way to inspect whether a reasoning model may be acting against user intent.
The company frames CoT monitoring as one layer in a broader defense-in-depth stack for agentic systems. If a model's intermediate reasoning stays legible, monitors can sometimes catch unsafe plans, reward hacking attempts, or other forms of misbehavior before those actions matter downstream. The new question OpenAI is testing is whether a frontier model can intentionally change those traces in ways that make monitoring less reliable while still solving the task.
To measure that, OpenAI introduced CoT-Control, an open-source evaluation suite built from more than 13,000 tasks. The suite combines benchmark problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified with instructions that are meant to alter the model's internal reasoning style, such as avoiding keywords or changing formatting inside the chain of thought. OpenAI says current frontier reasoning models still perform poorly at this kind of deliberate internal control.
Why the result matters
According to the OpenAI paper, the 13 reasoning models tested scored from 0.1% up to 15.4% on controllability, with no frontier model showing strong, reliable ability to conceal its reasoning. OpenAI also reports that controllability rises somewhat with model size, but tends to fall when models reason for longer or undergo additional post-training. The company treats that as good news for now: if models cannot reliably rewrite their chains of thought, monitorability is less likely to fail because of strategic concealment.
This is important beyond a single benchmark. As coding agents and research agents get more autonomy, safety teams need practical signals that scale with more capable behavior. OpenAI says it plans to report CoT controllability alongside CoT monitorability in future system cards, starting with GPT-5.4 Thinking. That turns this release from a one-off research note into an operational metric the company expects to track over time.
OpenAI also warns that low controllability today does not guarantee long-term monitorability. Other failure modes could still make reasoning traces less useful as models scale. Even so, the March 5 release gives the AI safety conversation a concrete benchmark rather than only a theoretical concern. Sources: the original X post, OpenAI's research page, and the linked paper: X, OpenAI research page, paper.
Related Articles
OpenAI introduced a new evaluation suite and research paper on Chain-of-Thought controllability. The company says GPT-5.4 Thinking shows low ability to obscure its reasoning, which supports continued use of CoT monitoring as a safety signal.
OpenAI said on March 5, 2026 that GPT-5.4 is rolling out across ChatGPT, the API, and Codex. The new model combines GPT-5.3-Codex coding capability with OpenAI’s mainline reasoning stack, adds native computer-use features, and introduces experimental 1M-token context in Codex.
OpenAI says GPT-5.4 Thinking is shipping in ChatGPT, with GPT-5.4 also live in the API and Codex and GPT-5.4 Pro available for harder tasks. The launch packages reasoning, coding, and native computer use into a single professional-work model with up to 1M tokens of context.
Comments (0)
No comments yet. Be the first to comment!