Karpathy open-sources autoresearch for autonomous single-GPU nanochat experiments
Original: Karpathy open-sources autoresearch for autonomous single-GPU nanochat experiments View original →
What Karpathy published
On March 7, Andrej Karpathy said he had packaged his recent autoresearch work into a self-contained repository that others could try over a weekend. The tweet describes a stripped-down single-GPU version of the nanochat training core where the human edits Markdown instructions and the AI agent edits the Python training code. Instead of prompting for one-off answers, the project turns an agent into a loop that proposes code changes, runs training, measures the outcome, and keeps iterating.
How the repo works
The GitHub page describes autoresearch as AI agents running research on single-GPU nanochat training automatically. Each experiment is budgeted at exactly five minutes, which gives roughly 12 runs per hour and about 100 while a user sleeps. The agent works on a Git feature branch, accumulates commits, and selects for lower validation loss rather than for subjective impressions. Karpathy’s framing is that the human should stop hand-editing the training loop and instead program the research organization itself through files such as program.md.
The repository is intentionally minimal. Karpathy says the training core is compressed into about 630 lines for a single-GPU setup, which makes the loop easier for an agent to inspect and modify. The README also notes that the current version expects one NVIDIA GPU, while forks can extend the idea to other platforms. That scope choice matters because it keeps the benchmark small enough to iterate quickly but still real enough to test whether an agent can improve a nontrivial training system.
Why this matters
The broader significance is not the specific nanochat baseline. It is the attempt to make autonomous research measurable, cheap, and repeatable. Fixed five-minute runs, Git-native versioning, and validation-loss-based selection create a cleaner testbed for comparing prompts, agents, and coordination strategies. If projects like this mature, the relevant question for research teams shifts from can an agent write code to can an agent run a disciplined experimental program that compounds over time.
Sources: Karpathy X post, GitHub
Related Articles
Open-model competition is shifting from leaderboard scores to agent operating costs. NVIDIA says Nemotron 3 Ultra is a 550B MoE model with 5x faster inference and up to 30% lower cost for complex agentic tasks.
HN interest centered less on “Claude finds bugs” and more on the shape of a harness security teams can adapt for their own targets.
Google Research is turning enterprise RAG into an iterative agent workflow, not a one-shot retrieval step. Its sufficient-context check lifted factuality accuracy by up to 34% and reached 90.1% accuracy in a cross-corpus FramesQA setup.