Karpathy says autoresearch cut nanochat Time to GPT-2 by about 11%
Original: Karpathy says autoresearch improved nanochat Time to GPT-2 by about 11% View original →
On March 9, 2026, Andrej Karpathy said his open-source autoresearch setup improved nanochat enough to cut the project's Time to GPT-2 metric from 2.02 hours to 1.80 hours, or about 11 percent. In the same X post, he said the agent ran autonomously for roughly two days on a depth=12 model, found about 20 additive improvements, and explored around 700 changes overall before he promoted the results to larger depth=24 models.
The claim matters because autoresearch is not a benchmark wrapper or a static hyperparameter sweep. In the repository README, Karpathy describes it as a recipe where an AI agent edits a small but real training setup, runs five-minute experiments, evaluates validation bits per byte, and keeps or discards changes. The core target is a simplified single-GPU version of nanochat, with the agent mainly editing train.py while the human changes program.md to steer the research process.
Karpathy also published the commit behind this round. The changes span optimizer and schedule settings, attention scaling, initialization, attention windows, and regularization choices. He highlighted examples including sharper post-QK-norm scaling, new per-group Adam settings, tuned weight decay schedules, and tighter treatment of value embeddings. The README says the default setup targets a single NVIDIA GPU and was tested on an H100, which keeps the project small enough for overnight autonomous experimentation.
Even so, the result should be read as a source claim from Karpathy rather than an independently audited benchmark. He explicitly said the work is not yet novel research and framed it as an engineering proof point: agent swarms can already search model-training improvements that used to be manual and iterative. That framing is arguably the bigger story than the 11 percent number itself.
If the approach generalizes, autoresearch-style loops could become a standard part of model-development stacks, especially for smaller experiments that can cheaply proxy larger runs. The practical takeaway for teams is that the bottleneck may shift from writing every training tweak by hand to designing the evaluation loop, constraints, and research instructions that autonomous agents follow.
Related Articles
Andrej Karpathy has published autoresearch, a minimal repo that lets AI agents iterate on a stripped-down nanochat training loop overnight. The project turns agent evaluation into a closed-loop research workflow with fixed 5-minute runs, Git branches, and validation-loss-based selection.
A Hacker News submission highlighted Andrej Karpathy's Autoresearch repo, a minimal setup where an AI agent edits one training file, runs fixed 5-minute experiments, and keeps only changes that improve `val_bpb`.
A popular r/LocalLLaMA thread points to karpathy/autoresearch, a small open-source setup where an agent edits one training file, runs 5-minute experiments, and iterates toward lower validation bits per byte.
Comments (0)
No comments yet. Be the first to comment!