r/MachineLearning Questions Whether COCONUT’s “Latent Reasoning” Comes from Architecture or Curriculum
Original: [D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization View original →
What the Reddit replication is challenging
A March 2026 discussion on r/MachineLearning took aim at one of the more intriguing reasoning claims in recent LLM research: Meta's COCONUT architecture, which replaces human-readable chain-of-thought tokens with recycled hidden states in a continuous latent space. The original idea is attractive because it suggests models might reason without emitting explicit text traces. The Reddit author, however, argues that the eye-catching result may come mostly from the training curriculum rather than from the recycled hidden-state mechanism itself. The thread reached 107 points and 14 comments at crawl time.
The post is not only an opinion. The author trained four GPT-2-scale models on ProsQA using rented H100 time. M1 is a chain-of-thought baseline. M2 is COCONUT-style hidden-state recycling. M3 keeps the same curriculum and thought budget but replaces recycled content with a fixed learned embedding. M4 keeps those fixed embeddings and also preserves multi-pass sequential processing. This setup is designed to separate two possible explanations for COCONUT's gains: information carried by recycled hidden states, or the curriculum and processing structure around them.
Why the control matters
The linked repository README summarizes the central result clearly. On in-distribution ProsQA, the COCONUT-style model reaches 97.0% accuracy, but the supposedly weaker M3 control reaches 96.6% despite having no information flow between reasoning steps and only one pass. That is the key challenge to the original narrative: if a fixed embedding plus the same curriculum lands almost the same score, then recycled hidden states may not be doing the conceptual work people attributed to them.
The Reddit author pushes further with out-of-distribution tests. On 7-hop chains, the M4 control outperforms COCONUT by 10.9 percentage points, and on DAG-structured tasks the sequential multi-pass setup helps while the recycled content itself appears to hurt extrapolation. The README's phrasing is blunt: the curriculum teaches the model how to use extra compute positions, while the content of the thought tokens matters far less than the training procedure and processing schedule.
What this means for the latent-reasoning debate
If this replication holds up, the lesson is not that latent reasoning is fake. It is more subtle. The models may still build structured internal states, but the specific headline mechanism could be less important than the curriculum that progressively removes explicit thought tokens. That would redirect effort away from searching for a magical latent token design and toward better training schemes, control experiments, and out-of-distribution evaluation.
The author is also explicit about limits: one seed, GPT-2 scale, and ProsQA-only evidence. That is not enough to settle the question for larger frontier models. Still, the post matters because it applies a standard that AI reasoning papers often need more of: factorial controls that isolate what actually changed. For practitioners, the engineering takeaway is straightforward. When a new reasoning method reports large gains, it is worth asking whether the win comes from the mechanism in the paper title, or from the training curriculum, extra passes, and compute budget quietly bundled with it.
Related Articles
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
Claude Fable 5 has moved to the top of Artificial Analysis’s GDPval-AA benchmark with a 1932 score. The result puts Anthropic models in three of the top four slots and raises the bar for long-running agentic knowledge work.