Karpathy says autoresearch cut nanochat Time to GPT-2 by about 11%

Original: Karpathy says autoresearch improved nanochat Time to GPT-2 by about 11% View original →

Read in other languages: 한국어日本語
LLM Mar 13, 2026 By Insights AI 2 min read 3 views Source
Karpathy says autoresearch cut nanochat Time to GPT-2 by about 11%

On March 9, 2026, Andrej Karpathy said his open-source autoresearch setup improved nanochat enough to cut the project's Time to GPT-2 metric from 2.02 hours to 1.80 hours, or about 11 percent. In the same X post, he said the agent ran autonomously for roughly two days on a depth=12 model, found about 20 additive improvements, and explored around 700 changes overall before he promoted the results to larger depth=24 models.

The claim matters because autoresearch is not a benchmark wrapper or a static hyperparameter sweep. In the repository README, Karpathy describes it as a recipe where an AI agent edits a small but real training setup, runs five-minute experiments, evaluates validation bits per byte, and keeps or discards changes. The core target is a simplified single-GPU version of nanochat, with the agent mainly editing train.py while the human changes program.md to steer the research process.

Karpathy also published the commit behind this round. The changes span optimizer and schedule settings, attention scaling, initialization, attention windows, and regularization choices. He highlighted examples including sharper post-QK-norm scaling, new per-group Adam settings, tuned weight decay schedules, and tighter treatment of value embeddings. The README says the default setup targets a single NVIDIA GPU and was tested on an H100, which keeps the project small enough for overnight autonomous experimentation.

Even so, the result should be read as a source claim from Karpathy rather than an independently audited benchmark. He explicitly said the work is not yet novel research and framed it as an engineering proof point: agent swarms can already search model-training improvements that used to be manual and iterative. That framing is arguably the bigger story than the 11 percent number itself.

If the approach generalizes, autoresearch-style loops could become a standard part of model-development stacks, especially for smaller experiments that can cheaply proxy larger runs. The practical takeaway for teams is that the bottleneck may shift from writing every training tweak by hand to designing the evaluation loop, constraints, and research instructions that autonomous agents follow.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.