2026년 3월 21일 게시된 r/LocalLLaMA 글 "Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning"은 March 22, 2026 기준 86 upvotes와 7 comments를 기록했다. 게시글은 Graph RAG, 정확히는 KET-RAG 기반 실험을 요약하면서, retrieval은 이미 상당 부분 해결됐지만 reasoning이 여전히 정확도를 막고 있다고 주장한다. 작성자는 gold answer가 retrieved context 안에 77%에서 91% 비율로 이미 들어 있었지만, 실제 오류의 73%에서 84%는 reasoning failure였다고 정리했다.

링크된 arXiv 논문 "The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA"도 같은 메시지를 더 정교하게 뒷받침한다. 논문은 HotpotQA, MuSiQue, 2WikiMultiHopQA 세 benchmark에서 KET-RAG를 평가했고, SPARQL chain-of-thought prompting과 graph-walk compression 두 가지 보강 기법을 제안했다. 후자는 추가 LLM call 없이 context를 약 60% 압축하며, 논문은 fully augmented Llama-8B가 plain Llama-70B baseline을 matched or exceeded하고 cost는 약 12배 낮았다고 설명한다.

retrieval coverage: gold answer가 context에 77%~91% 포함
error source: 오류의 73%~84%가 reasoning failure
보강 기법: SPARQL chain-of-thought prompting, graph-walk compression
효율 주장: augmented Llama-8B가 unaugmented 70B baseline에 근접 혹은 상회, cost는 약 12배 절감

이 결과가 r/LocalLLaMA에서 중요한 이유는 model scaling만이 성능 향상의 유일한 수단이 아니라는 점을 보여주기 때문이다. retrieval이 이미 충분히 강하다면, 다음 개선 포인트는 더 큰 vector store도, 더 큰 base model도 아니라 질문 분해 방식과 context 구조화일 수 있다. 즉 smaller open model이라도 inference-time orchestration만 잘하면 훨씬 큰 model과 경쟁할 수 있다는 뜻이다.

물론 benchmark에서 통하는 방법이 production workload 전반에 그대로 적용된다고 단정할 수는 없다. question routing, graph quality, domain-specific knowledge base 성숙도에 따라 결과는 달라질 수 있다. 그럼에도 이 스레드는 Graph-RAG 논의가 "retrieve more"에서 "reason better with what is already retrieved"로 이동하고 있음을 상징적으로 보여준다.

#graph-rag

r/LocalLLaMA가 본 Graph-RAG, Llama 8B도 multi-hop QA에서 70B에 근접할 수 있다