#memory-systems - Insights

LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.

#benchmarks #memory-systems #evaluation