- The paper introduces a tiered memory architecture decomposing agent memory into episodic, semantic, and procedural layers.
- It employs a two-stage retrieval process using BM25 and adaptive signal weights, yielding a 33-percentage point improvement over baselines.
- Empirical diagnostics reveal that the semantic tier is critical while legacy BM25 limits performance, motivating a shift toward dense retrieval.
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
Background and Motivation
The evolution of LLM-based autonomous agents from simple stateless chatbots to persistent, tool-using systems has introduced new challenges in long-term memory management and agent coherence. Current memory architectures, particularly in widely deployed runtimes such as OpenClaw, rely on flat-file structures with limited context windows, leading to systematic degradation of agent performance over extended operation periods. Four principal failure modes are identified: context collapse, compaction discontinuity, structural blindness, and lack of attribution-based feedback. These weaknesses materially affect tool invocation success rates, reducing them by 14 percentage points over a 72-hour operational window.
MEMTIER Architecture
MemTier proposes a tripartite architecture that decomposes agent memory into structured episodic logs, a semantic tier for distilled facts, and a procedural tier. The episodic layer utilizes JSONL files with explicit project isolation and per-entry cognitive weights reflecting the historical utility of individual memory records. The semantic tier aggregates distilled, deduplicated facts extracted via LLM heuristics and provides cross-agent sharing while maintaining context isolation.
The retrieval subsystem is formulated as a five-signal weighted scoring function, incorporating BM25 lexical relevance, exponential time decay, cognitive weight signals (as learned from downstream tool execution outcomes), and tiered boosts. Retrieval proceeds in a two-stage process: candidate session selection from the semantic tier via BM25, followed by focused episodic entry scoring. A PPO-based policy framework is introduced for automated signal weight adaptation, although empirical evaluations show the architecture itself, rather than the weights or the underlying LLM, is the chief limiting factor (performance invariance across weight configurations and between a 7B and 284B MoE generator).
Empirical Evaluation
Using the LongMemEval-S benchmark (500 questions, 53-session haystacks), MemTier with semantic pre-population achieves Accuracy = 0.382 and F1 = 0.412 with Qwen2.5-7B on a consumer-grade 6GB GPU—a 33 percentage point improvement over the full-context baseline (0.050 → 0.382). On single-session recall tasks, MemTier reaches 0.686–0.732, substantially outperforming the RAG BM25 GPT-4o baseline (0.560). Temporal reasoning and multi-session synthesis see qualitative gains from structured semantic extraction, although these remain below optimal (0.323 and 0.173, respectively).
Ablation studies reveal the semantic tier is the dominant performance driver (removal results in a 51× F1 reduction, ΔAcc -0.128), with two-stage scoping and individual signals (decay, cognitive weight, tier boost) providing additive, non-redundant benefits. Optimal tuned values for retrieval entries (k=2) and token budgets (600 tokens) further improve accuracy.
LoCoMo, a conversational memory benchmark where full conversation history is provided at query time, demonstrates memory architecture irrelevance (MemTier and baseline scores are statistically identical). This underscores the necessity of benchmarks that explicitly test long-term, agentic retrieval, separate from in-context comprehension.
Diagnostic Findings and Analysis
MemTier's evaluations identify a three-layer invariance: performance is bounded primarily by the BM25 retrieval architecture, not the generator size nor the adaptive signal weights. PPO-based adaptation of retrieval weights yields negligible gains due to BM25 signal dominance and proxy-derived reward traps. This finding motivates a transition to recall-first dense retrieval systems.
Token efficiency analysis shows semantic tier condensation (from ∼509 to ∼3.1 facts/question) drives both compression and substantial F1 improvements, recommending precision over coverage in fact extraction pipelines.
Limitations include hardware constraints blocking optimal logprob-based attribution, BM25 scoring dominance masking RL-weight adaptation, and coarse KV-pattern heuristics in semantic fact extraction—suggesting further gains from dense retrieval and more advanced relation-extraction.
Practical and Theoretical Implications
MemTier's approach establishes robust, high-precision memory isolation and structured retrieval as prerequisites for sustained autonomous agent performance. The findings demonstrate that architectural choices—e.g., tiered memory, retrieval bottlenecks, consolidated semantic distillation—directly govern agentic operational boundaries. While generator scaling and adaptive weighting are nominally attractive, empirical invariance highlights that architectural memory constraints dictate performance ceilings.
Future implementations should transition to dense retrieval for enhanced multi-session synthesis and temporal resolution, integrate fine-grained NLP-based fact extraction for semantic tier enrichment, and support higher-fidelity RL or attribution loops. The findings endorse the LongMemEval-S as the standard for evaluating persistent agent memory, cautioning against reliance on benchmarks with context-injected queries (e.g., LoCoMo).
Conclusion
MemTier demonstrates that structured, tiered memory architectures with principled retrieval pipelines enable significant advances in agentic long-horizon performance, as validated through rigorous benchmarking. The primary bottleneck lies in legacy BM25 retrieval systems, suggesting that further progress necessitates recall-prior dense retrieval architectures and advanced extraction mechanisms. The architecture's modularity and empirical diagnostics delineate a clear trajectory for advancing autonomous agent memory systems in both practical deployments and theoretical research (2605.03675).