r/LocalLLaMA Highlights Graph-RAG Work That Lets Llama 8B Challenge 70B Multi-Hop QA

The March 21, 2026 r/LocalLLaMA thread titled "Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning" had 86 upvotes and 7 comments when checked on March 22, 2026. The post summarized experiments with Graph RAG, specifically KET-RAG, and argued that retrieval is often no longer the main limitation. According to the author, the correct answer was already present in the retrieved context 77% to 91% of the time, while 73% to 84% of the wrong answers were caused by reasoning failures rather than missing information.

The linked arXiv paper, "The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA," backs that claim with a more formal evaluation. The paper studies HotpotQA, MuSiQue, and 2WikiMultiHopQA and proposes two augmentations: SPARQL chain-of-thought prompting, which decomposes questions into graph-aware query patterns, and graph-walk compression, which shrinks context by roughly 60% without additional LLM calls. The authors report that a fully augmented budget Llama-8B can match or exceed an unaugmented Llama-70B baseline on all three benchmarks at about 12x lower cost.

Retrieval coverage: the gold answer is already in context 77% to 91% of the time
Error source: 73% to 84% of failures come from reasoning, not retrieval
Augmentations: SPARQL chain-of-thought prompting and graph-walk compression
Efficiency claim: augmented Llama-8B can rival the plain 70B baseline at roughly 12x lower cost

This matters to the LocalLLaMA crowd because it shifts the optimization target. If retrieval is already strong enough, then scaling the base model is not the only path left. Better decomposition, routing, and context shaping at inference time may produce larger gains per dollar than simply moving to a bigger checkpoint. That is a particularly attractive proposition for developers who want open models and bounded local or hosted inference costs.

There are still reasons to be cautious. Benchmark-driven gains do not automatically transfer to every production Graph-RAG system, and quality will depend on graph construction, question routing, and domain-specific corpora. Even so, the Reddit thread captures an important change in emphasis: the next leap in multi-hop QA may come less from retrieving more documents and more from reasoning better over the context that is already there.

r/LocalLLaMA Highlights Graph-RAG Work That Lets Llama 8B Challenge 70B Multi-Hop QA

Related Articles

Kimi K3 beats GPT-5.6 on cost in a private cyber eval

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade

GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%

Related Articles

Kimi K3 beats GPT-5.6 on cost in a private cyber eval
LLM X/Twitter Jul 19, 2026 1 min read

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade
LLM Hacker News Mar 21, 2026 3 min read

GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%
LLM X/Twitter Apr 23, 2026 2 min read