LocalLLaMA Highlights a 356K-Row Human Code Review Dataset for Training Coding Models

Original: Code Review Dataset: 200k+ Cases of Human-Written Code Reviews from Top OSS Projects View original →

Read in other languages: 한국어日本語
LLM Mar 10, 2026 By Insights AI (Reddit) 2 min read 4 views Source

What the dataset contains

A LocalLLaMA post surfaced a new Hugging Face dataset called github-codereview, aimed squarely at training and evaluating coding models on review behavior instead of raw code completion. The Reddit thread reached 70 points and 15 comments at crawl time. The Hugging Face card lists 355,807 rows in total, about 334k in train, and roughly 653 MB of Parquet data. The dataset is explicitly framed around human-written code reviews, not synthetic instructions or benchmark prompts.

The core unit is a before-and-after code change paired with an inline reviewer comment. The dataset card says each row captures a moment where a human reviewer left an inline pull-request comment and the author later changed the code in response. Just as important, it also includes negative examples: changed files from reviewed PRs that did not attract comments and are labeled with “No issues found.” That makes the corpus useful not only for generating review comments, but also for learning when a model should stay quiet.

Why this is more interesting than generic code corpora

The dataset card lists 167K+ positive triplets, 51K+ negative examples, and 37 programming languages including Python, TypeScript, Go, Rust, C++, JavaScript, Java, Kotlin, and Swift. The maintainers say bot reviewers and auto-generated content are excluded, the extracted context is chunk-focused at around 50 lines, and the source repositories use permissive licenses such as MIT, Apache-2.0, and BSD. Those design choices matter because many existing code datasets are dominated by full files or commit diffs, which are useful for pretraining but less aligned with review-time decisions.

Here the signal is narrower and more operational: what did a human reviewer flag, what code changed afterward, and when did no comment happen at all? That is directly relevant for coding agents and review assistants, which often over-comment, repeat style lint, or fail to distinguish correctness issues from already-acceptable code.

Collection method and leakage control

The dataset card says the corpus was built from top GitHub repositories with permissive licenses, merged pull requests, and inline review comments. The pipeline then fetches file contents at the review commit and at the PR head, extracts focused chunks around the commented line, and constructs triplets from that review event. Splits are deterministic by repository, so examples from the same repository do not land in multiple splits. That is a useful guardrail against the kind of repo-level leakage that can make coding evaluations look stronger than they really are.

The Reddit submitter says the corpus was also used to fine-tune a Qwen2.5-Coder-32B variant for code review, but even without adopting that exact model claim, the public dataset itself is the main story. For teams building coding agents, this is the kind of supervision signal that can improve review quality, patch suggestions, and “don’t comment unless necessary” behavior better than generic instruction data alone.

Hugging Face dataset · Reddit discussion

Share:

Related Articles

LLM 3d ago 2 min read

GitHub said on March 5, 2026 that Copilot code review now runs on an agentic tool-calling architecture and is generally available for Copilot Pro, Pro+, Business, and Enterprise. The update is designed to pull wider repository context into reviews so comments are higher-signal and less noisy.

LLM sources.twitter 5d ago 2 min read

GitHub said on March 5, 2026 that GPT-5.4 is now generally available and rolling out in GitHub Copilot. The company claims early testing showed higher success rates plus stronger logical reasoning and task execution on complex, tool-dependent developer workflows.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.