OpenAI says GPT-5.4 Thinking still struggles to hide its chain of thought

What OpenAI released

On March 5, 2026, OpenAI said in an X post that it is releasing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. The linked OpenAI research page argues that GPT-5.4 Thinking still has low ability to deliberately reshape or hide its own reasoning traces. In OpenAI's interpretation, that is reassuring for safety work because CoT monitoring remains a useful way to inspect whether a reasoning model may be acting against user intent.

The company frames CoT monitoring as one layer in a broader defense-in-depth stack for agentic systems. If a model's intermediate reasoning stays legible, monitors can sometimes catch unsafe plans, reward hacking attempts, or other forms of misbehavior before those actions matter downstream. The new question OpenAI is testing is whether a frontier model can intentionally change those traces in ways that make monitoring less reliable while still solving the task.

To measure that, OpenAI introduced CoT-Control, an open-source evaluation suite built from more than 13,000 tasks. The suite combines benchmark problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified with instructions that are meant to alter the model's internal reasoning style, such as avoiding keywords or changing formatting inside the chain of thought. OpenAI says current frontier reasoning models still perform poorly at this kind of deliberate internal control.

Why the result matters

According to the OpenAI paper, the 13 reasoning models tested scored from 0.1% up to 15.4% on controllability, with no frontier model showing strong, reliable ability to conceal its reasoning. OpenAI also reports that controllability rises somewhat with model size, but tends to fall when models reason for longer or undergo additional post-training. The company treats that as good news for now: if models cannot reliably rewrite their chains of thought, monitorability is less likely to fail because of strategic concealment.

This is important beyond a single benchmark. As coding agents and research agents get more autonomy, safety teams need practical signals that scale with more capable behavior. OpenAI says it plans to report CoT controllability alongside CoT monitorability in future system cards, starting with GPT-5.4 Thinking. That turns this release from a one-off research note into an operational metric the company expects to track over time.

OpenAI also warns that low controllability today does not guarantee long-term monitorability. Other failure modes could still make reasoning traces less useful as models scale. Even so, the March 5 release gives the AI safety conversation a concrete benchmark rather than only a theoretical concern. Sources: the original X post, OpenAI's research page, and the linked paper: X, OpenAI research page, paper.

OpenAI says GPT-5.4 Thinking still struggles to hide its chain of thought

What OpenAI released

Why the result matters

Related Articles

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study

HN Meets GPT-5.5 API With a Price-and-Behavior Audit, Not a Victory Lap

OpenAI rolls GPT-5.4 Thinking and GPT-5.4 Pro across ChatGPT, API, and Codex

Comments (0)

Leave a Comment

Related Articles

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study
LLM Mar 12, 2026 2 min read

HN Meets GPT-5.5 API With a Price-and-Behavior Audit, Not a Victory Lap

OpenAI rolls GPT-5.4 Thinking and GPT-5.4 Pro across ChatGPT, API, and Codex
LLM X/Twitter Mar 21, 2026 2 min read