r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability
Original: [D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers View original →
A March 27, 2026 discussion post on r/MachineLearning reopened a basic question about long-term memory benchmarks: are teams optimizing for real memory, or for noisy tests that reward the wrong behavior? The post, written by Penfield Labs, argues that LoCoMo remains widely cited even though its answer key contains 99 score-corrupting errors across 1,540 questions. The authors also say the benchmark's LLM judge, configured with gpt-4o-mini, accepted 62.81% of intentionally wrong but topically adjacent answers in their audit.
The concrete examples are hard to dismiss. One answer key expects "Ferrari 488 GTB" even though the conversation only says "this beauty" and the model-accessible caption is "a red sports car"; the specific car name appears only in an internal query field that systems never ingest. Another question resolves "last Saturday" to Sunday instead of the preceding Saturday. The audit also flags 24 questions with wrong speaker attribution. If those findings hold, a perfect system could never score 100%; the post estimates the ceiling at roughly 93.6%.
The critique goes beyond LoCoMo itself. The same post argues that LongMemEval-S now measures context-window management more than long-term memory retrieval because each question's corpus is around 115K tokens, a size that fits inside contemporary 128K to 1M-token windows. LoCoMo-Plus earns some credit for adding "cognitive" questions with weaker lexical overlap, but the authors note that it inherits the original 1,540 LoCoMo questions unchanged, along with the same broken ground truth and the same dependence on gpt-4o-mini-style judging for older categories.
What makes the thread important is not just the error count. It is the reminder that benchmark governance is infrastructure. If different projects use different ingestion pipelines, prompts, embedding models, and judge configurations, published tables stop being apples-to-apples comparisons long before anyone notices. The post calls for corpora that exceed current context windows, adversarial judge validation, stronger evaluators, and full methodological disclosure. For anyone building persistent memory systems, the message from r/MachineLearning is blunt: scoreboards matter only when the measurement stack is trustworthy.
Related Articles
A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.
A high-scoring discussion in r/MachineLearning asks what benchmarking papers are for when proprietary models change monthly and old versions disappear. The strongest replies argued that model rankings go stale fast, but the datasets and failure cases can remain useful as durable eval assets.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
Comments (0)
No comments yet. Be the first to comment!