r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

What r/MachineLearning is challenging

A discussion on r/MachineLearning is questioning how much confidence the field should place in LoCoMo, one of the most cited long-term memory benchmarks. The post summarizes an independent audit that found 99 score-corrupting errors across 1,540 questions, a 6.4% error rate. The authors say the problems are not cosmetic. They include hallucinated facts in the answer key, broken temporal reasoning, and speaker attribution mistakes that can penalize systems for giving the actually correct answer.

The examples are specific enough to make the concern hard to dismiss. One answer key names a Ferrari 488 GTB even though the conversation only refers to “this beauty” and an image caption only says “a red sports car.” Another question resolves “Last Saturday” incorrectly on a Thursday. The audit also says 24 questions assign statements to the wrong speaker. From that, the post argues that even a perfect system would top out around 93.6% rather than 100%.

The judge is part of the problem

The Reddit post goes further than ground truth cleanup. It argues that the LLM judge used in LoCoMo evaluation is too permissive to separate strong systems from weak ones. Using the same judge configuration, the authors say intentionally wrong but topically adjacent answers were accepted 62.81% of the time. Specific factual mistakes were often caught, but vague answers that identified the right conversation while missing the actual details still passed at a very high rate. That is a serious issue for a benchmark that is supposed to test memory retrieval quality.

The thread also pushes back on LongMemEval-S as an automatic replacement. Its complaint is different: if a benchmark’s corpus fits inside current context windows, then scores increasingly reflect context handling rather than genuine persistent memory. That means both benchmark design and benchmark judging are under scrutiny at the same time.

Why this matters

The broader value of the r/MachineLearning post is that it shifts the conversation from leaderboard numbers to evaluation infrastructure. If published tables mix different ingestion pipelines, different answer prompts, and weak judges, small score differences stop being meaningful. The community concern here is not just that LoCoMo may need patching. It is that memory-system benchmarking needs better ground truth, stronger judges, and more standardized protocols before the results can support strong claims.

r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

What r/MachineLearning is challenging

The judge is part of the problem

Why this matters

Related Articles

r/MachineLearning debates whether LLM benchmark papers age out before they matter

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

Comments (0)

Leave a Comment

Related Articles

r/MachineLearning debates whether LLM benchmark papers age out before they matter
LLM Reddit Mar 13, 2026 2 min read

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing