LocalLLaMA Highlights a 356K-Row Human Code Review Dataset for Training Coding Models

What the dataset contains

A LocalLLaMA post surfaced a new Hugging Face dataset called github-codereview, aimed squarely at training and evaluating coding models on review behavior instead of raw code completion. The Reddit thread reached 70 points and 15 comments at crawl time. The Hugging Face card lists 355,807 rows in total, about 334k in train, and roughly 653 MB of Parquet data. The dataset is explicitly framed around human-written code reviews, not synthetic instructions or benchmark prompts.

The core unit is a before-and-after code change paired with an inline reviewer comment. The dataset card says each row captures a moment where a human reviewer left an inline pull-request comment and the author later changed the code in response. Just as important, it also includes negative examples: changed files from reviewed PRs that did not attract comments and are labeled with “No issues found.” That makes the corpus useful not only for generating review comments, but also for learning when a model should stay quiet.

Why this is more interesting than generic code corpora

The dataset card lists 167K+ positive triplets, 51K+ negative examples, and 37 programming languages including Python, TypeScript, Go, Rust, C++, JavaScript, Java, Kotlin, and Swift. The maintainers say bot reviewers and auto-generated content are excluded, the extracted context is chunk-focused at around 50 lines, and the source repositories use permissive licenses such as MIT, Apache-2.0, and BSD. Those design choices matter because many existing code datasets are dominated by full files or commit diffs, which are useful for pretraining but less aligned with review-time decisions.

Here the signal is narrower and more operational: what did a human reviewer flag, what code changed afterward, and when did no comment happen at all? That is directly relevant for coding agents and review assistants, which often over-comment, repeat style lint, or fail to distinguish correctness issues from already-acceptable code.

Collection method and leakage control

The dataset card says the corpus was built from top GitHub repositories with permissive licenses, merged pull requests, and inline review comments. The pipeline then fetches file contents at the review commit and at the PR head, extracts focused chunks around the commented line, and constructs triplets from that review event. Splits are deterministic by repository, so examples from the same repository do not land in multiple splits. That is a useful guardrail against the kind of repo-level leakage that can make coding evaluations look stronger than they really are.

The Reddit submitter says the corpus was also used to fine-tune a Qwen2.5-Coder-32B variant for code review, but even without adopting that exact model claim, the public dataset itself is the main story. For teams building coding agents, this is the kind of supervision signal that can improve review quality, patch suggestions, and “don’t comment unless necessary” behavior better than generic instruction data alone.

Hugging Face dataset · Reddit discussion

LocalLLaMA Highlights a 356K-Row Human Code Review Dataset for Training Coding Models

What the dataset contains

Why this is more interesting than generic code corpora

Collection method and leakage control

Related Articles

GitHub says Copilot code review has reached 60 million runs as AI shipping pressure rises

GitHub puts Copilot code review on an agentic architecture and broadens GA

AI coding slows down when review becomes the product

Related Articles

GitHub says Copilot code review has reached 60 million runs as AI shipping pressure rises
LLM X/Twitter Mar 20, 2026 2 min read

GitHub puts Copilot code review on an agentic architecture and broadens GA
LLM Mar 10, 2026 2 min read

AI coding slows down when review becomes the product
LLM Hacker News May 26, 2026 1 min read