LoCoMo, LongMemEval, MemoryAgentBench (ICLR 2026), Locomo-Plus — the benchmark landscape for AI agent memory. How memory systems are evaluated and compared.
HUST-AI-HYZ/MemoryAgentBench — GitHub
"Open source code for ICLR 2026 Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions" — GitHub
The newest formal benchmark: MemoryAgentBench from HUST was accepted to ICLR 2026. It evaluates memory in LLM agents specifically via incremental multi-turn interactions — simulating how agents remember and forget over extended conversations.
"Based on LOCOMO, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event reasoning, and preference recall." — snap-research.github.io/locomo
LoCoMo (Long-Term Conversation Memory) is the most widely cited benchmark for AI agent memory. It evaluates:
"We introduce LoCoMo-Plus, a benchmark that targets beyond-factual cognitive memory evaluation for LLM agents." — arXiv:2602.10715v1, February 11, 2026
Locomo-Plus extends LoCoMo beyond factual recall to cognitive memory — testing reasoning, inference, and application of past context.
"LoCoMo and LongMemEval are still a valid foundation — the question formats are good, the evaluation methodology is reasonable, and they remain the best available benchmarks." — Vectorize Hindsight, 2 weeks ago
"LOCOMO is a solid benchmark for measuring general long-term memory recall, but it does not capture application-level memory performance." — Mem0: State of AI Agent Memory 2026, 2 days ago
| Benchmark | Type | Focus | Status |
|---|---|---|---|
| ★ agent-memory | — | Production memory layer | MIT, Available |
| MemoryAgentBench | ICLR 2026 | Incremental multi-turn interactions | New |
| LoCoMo | Research | Long-term conversational memory | Widely used |
| Locomo-Plus | arXiv Feb 2026 | Beyond-factual cognitive memory | New |
| LongMemEval | Research | Long conversation memory | Secondary standard |
Mem0 claims on their GitHub:
"+26% Accuracy over OpenAI Memory on the LOCOMO benchmark" — mem0ai/mem0 on GitHub, 4 days ago
ICLR 2026. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. HUST-AI-HYZ.
Evaluating Very Long-Term Conversational Memory. snap-research.github.io/locomo
Beyond-Factual Cognitive Memory Evaluation. arXiv:2602.10715v1 (Feb 11, 2026)
TsinghuaC3I curated paper list. Memory in the Age of AI Agents: A Survey.
agent-memory is a production memory layer — not a benchmark. But it excels where benchmarks matter: