r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

Original: [D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers View original →

Read in other languages: 한국어日本語
LLM Mar 28, 2026 By Insights AI (Reddit) 2 min read 1 views Source

What r/MachineLearning is challenging

A discussion on r/MachineLearning is questioning how much confidence the field should place in LoCoMo, one of the most cited long-term memory benchmarks. The post summarizes an independent audit that found 99 score-corrupting errors across 1,540 questions, a 6.4% error rate. The authors say the problems are not cosmetic. They include hallucinated facts in the answer key, broken temporal reasoning, and speaker attribution mistakes that can penalize systems for giving the actually correct answer.

The examples are specific enough to make the concern hard to dismiss. One answer key names a Ferrari 488 GTB even though the conversation only refers to “this beauty” and an image caption only says “a red sports car.” Another question resolves “Last Saturday” incorrectly on a Thursday. The audit also says 24 questions assign statements to the wrong speaker. From that, the post argues that even a perfect system would top out around 93.6% rather than 100%.

The judge is part of the problem

The Reddit post goes further than ground truth cleanup. It argues that the LLM judge used in LoCoMo evaluation is too permissive to separate strong systems from weak ones. Using the same judge configuration, the authors say intentionally wrong but topically adjacent answers were accepted 62.81% of the time. Specific factual mistakes were often caught, but vague answers that identified the right conversation while missing the actual details still passed at a very high rate. That is a serious issue for a benchmark that is supposed to test memory retrieval quality.

The thread also pushes back on LongMemEval-S as an automatic replacement. Its complaint is different: if a benchmark’s corpus fits inside current context windows, then scores increasingly reflect context handling rather than genuine persistent memory. That means both benchmark design and benchmark judging are under scrutiny at the same time.

Why this matters

The broader value of the r/MachineLearning post is that it shifts the conversation from leaderboard numbers to evaluation infrastructure. If published tables mix different ingestion pipelines, different answer prompts, and weak judges, small score differences stop being meaningful. The community concern here is not just that LoCoMo may need patching. It is that memory-system benchmarking needs better ground truth, stronger judges, and more standardized protocols before the results can support strong claims.

Share: Long

Related Articles

LLM Reddit 5d ago 2 min read

A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.