r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

A March 27, 2026 discussion post on r/MachineLearning reopened a basic question about long-term memory benchmarks: are teams optimizing for real memory, or for noisy tests that reward the wrong behavior? The post, written by Penfield Labs, argues that LoCoMo remains widely cited even though its answer key contains 99 score-corrupting errors across 1,540 questions. The authors also say the benchmark's LLM judge, configured with gpt-4o-mini, accepted 62.81% of intentionally wrong but topically adjacent answers in their audit.

The concrete examples are hard to dismiss. One answer key expects "Ferrari 488 GTB" even though the conversation only says "this beauty" and the model-accessible caption is "a red sports car"; the specific car name appears only in an internal query field that systems never ingest. Another question resolves "last Saturday" to Sunday instead of the preceding Saturday. The audit also flags 24 questions with wrong speaker attribution. If those findings hold, a perfect system could never score 100%; the post estimates the ceiling at roughly 93.6%.

The critique goes beyond LoCoMo itself. The same post argues that LongMemEval-S now measures context-window management more than long-term memory retrieval because each question's corpus is around 115K tokens, a size that fits inside contemporary 128K to 1M-token windows. LoCoMo-Plus earns some credit for adding "cognitive" questions with weaker lexical overlap, but the authors note that it inherits the original 1,540 LoCoMo questions unchanged, along with the same broken ground truth and the same dependence on gpt-4o-mini-style judging for older categories.

What makes the thread important is not just the error count. It is the reminder that benchmark governance is infrastructure. If different projects use different ingestion pipelines, prompts, embedding models, and judge configurations, published tables stop being apples-to-apples comparisons long before anyone notices. The post calls for corpora that exceed current context windows, adversarial judge validation, stronger evaluators, and full methodological disclosure. For anyone building persistent memory systems, the message from r/MachineLearning is blunt: scoreboards matter only when the measurement stack is trustworthy.

r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

r/MachineLearning debates whether LLM benchmark papers age out before they matter

LLM judges hide instability: 33-67% of documents break consistency

Comments (0)

Leave a Comment

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

r/MachineLearning debates whether LLM benchmark papers age out before they matter
LLM Reddit Mar 13, 2026 2 min read

LLM judges hide instability: 33-67% of documents break consistency
LLM sources.research Apr 17, 2026 2 min read