Karpathy’s autoresearch turns short PyTorch runs into an overnight agent research loop
Original: karpathy / autoresearch View original →
Andrew Karpathy’s autoresearch repository packages a large idea into a deliberately small experiment: give an AI agent a compact PyTorch training setup, let it change the code, run a short training job, measure the result, and repeat the cycle overnight. The point is not just to automate training, but to see whether a tightly scoped agent loop can make real research progress without a large lab stack around it.
The repository keeps the moving parts to a minimum. According to the README, prepare.py handles one-time data preparation and runtime utilities, train.py is the single file the agent is expected to edit, and program.md is the human-authored instruction file that defines how the autonomous research setup should behave. The baseline training code is a simplified single-GPU implementation of nanochat, and the evaluation target is val_bpb, a metric that keeps runs comparable even if the agent changes architecture details.
- Each experiment runs on a fixed five-minute wall-clock budget.
- The agent edits only
train.py, keeping the search surface small and diffs reviewable. - The default environment targets Python 3.10+,
uv, and a single NVIDIA GPU. - The README already points users to community forks for macOS, MLX, and Windows.
What makes the project notable is its workflow design. Karpathy is effectively treating research process as code: humans write the high-level research organization in program.md, while the agent performs local search over optimizer behavior, model structure, batch sizing, and related training choices. That is a sharper loop than the classic pattern of writing an experiment, waiting for logs, inspecting failures, and manually iterating.
The LocalLLaMA interest is easy to understand. autoresearch gives open-source practitioners a compact test bed for autonomous research ideas without requiring distributed orchestration or heavy MLOps infrastructure. It also exposes the real constraints immediately. Hardware still matters, search spaces need guardrails, and the quality of the human-written instructions becomes part of the system itself. Even so, the repo is a strong minimal example of how agentic workflows can move from coding assistance into experimental iteration.
The community post is available on LocalLLaMA. The original materials are in the GitHub repository.
Related Articles
A popular r/LocalLLaMA thread points to karpathy/autoresearch, a small open-source setup where an agent edits one training file, runs 5-minute experiments, and iterates toward lower validation bits per byte.
Semble is an open-source code search library for AI agents that reduces token usage by 98% compared to grep+read, while achieving 99% of transformer model quality. It runs entirely on CPU with no external dependencies and integrates directly with Claude Code, Cursor, and Codex via MCP.
Anthropic is not only shipping a stronger Claude model; it is splitting the same base capability into a broad Fable release and a restricted Mythos track. The package includes $10/$50 token pricing, 30-day safety retention, and automatic fallback to Opus 4.8 for some high-risk requests.