r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Original: [D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers View original →

Read in other languages: 한국어日本語
LLM Mar 30, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A March 27, 2026 discussion post on r/MachineLearning reopened a basic question about long-term memory benchmarks: are teams optimizing for real memory, or for noisy tests that reward the wrong behavior? The post, written by Penfield Labs, argues that LoCoMo remains widely cited even though its answer key contains 99 score-corrupting errors across 1,540 questions. The authors also say the benchmark's LLM judge, configured with gpt-4o-mini, accepted 62.81% of intentionally wrong but topically adjacent answers in their audit.

The concrete examples are hard to dismiss. One answer key expects "Ferrari 488 GTB" even though the conversation only says "this beauty" and the model-accessible caption is "a red sports car"; the specific car name appears only in an internal query field that systems never ingest. Another question resolves "last Saturday" to Sunday instead of the preceding Saturday. The audit also flags 24 questions with wrong speaker attribution. If those findings hold, a perfect system could never score 100%; the post estimates the ceiling at roughly 93.6%.

The critique goes beyond LoCoMo itself. The same post argues that LongMemEval-S now measures context-window management more than long-term memory retrieval because each question's corpus is around 115K tokens, a size that fits inside contemporary 128K to 1M-token windows. LoCoMo-Plus earns some credit for adding "cognitive" questions with weaker lexical overlap, but the authors note that it inherits the original 1,540 LoCoMo questions unchanged, along with the same broken ground truth and the same dependence on gpt-4o-mini-style judging for older categories.

What makes the thread important is not just the error count. It is the reminder that benchmark governance is infrastructure. If different projects use different ingestion pipelines, prompts, embedding models, and judge configurations, published tables stop being apples-to-apples comparisons long before anyone notices. The post calls for corpora that exceed current context windows, adversarial judge validation, stronger evaluators, and full methodological disclosure. For anyone building persistent memory systems, the message from r/MachineLearning is blunt: scoreboards matter only when the measurement stack is trustworthy.

Share: Long

Related Articles

LLM Reddit 2d ago 2 min read

A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.