Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Published 21 May 2026 in cs.AI | (2605.22791v1)

Abstract: Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

Summary

  • The paper's main contribution is decoupling erase and write operations using independent channel-wise gates, enhancing memory editing in recurrent attention models.
  • It employs a fast-weight update and WY matrix transformation to maintain efficient chunkwise training and robust retrieval in long contexts.
  • Empirical results demonstrate improved language modeling, reduced interference, and state-of-the-art retrieval across benchmarks using the proposed architecture.

Decoupling Erase and Write in Linear Attention: The Gated DeltaNet-2 Architecture

Background and Motivation

Linear recurrent attention mechanisms have been developed to address the quadratic complexity of conventional self-attention in sequence models. By substituting the standard attention matrix with a fixed-size recurrent state, these approaches maintain constant memory usage and achieve linear computational cost in sequence length. However, fixed-state models face inherent limitations, notably the challenge of interference and information loss when compressing long contexts into bounded memory. In prior work, memory management hinged on applying scalar gates for global decay (forgetting) and targeted overwriting (editing) of state content. State-of-the-art recurrent models such as Gated DeltaNet and Kimi Delta Attention (KDA) incorporate decay and delta-rule residual editing, but these models fundamentally conflate the mechanisms for erasing obsolete information and committing new content to memory, tying both operations to a single scalar parameter per head.

Contributions of Gated DeltaNet-2

Gated DeltaNet-2 introduces a principled decoupling of the erase and write operations in linear recurrent attention, implementing channel-wise gates for each. The architecture maintains KDA's channel-wise decay but replaces its tied scalar delta gate with two independent, channel-wise gates: a key-side erase gate (bt[0,1]dkb_t \in [0,1]^{d_k}) and a value-side write gate (Wt[0,1]dvW_t \in [0,1]^{d_v}). This separation enables selective, channel-specific removal of outdated associations and insertion of new values, providing finer-grained memory editing. Gated DeltaNet-2 generalizes existing delta-rule recurrent attention models: it reduces to KDA when both gates collapse to a scalar and to Gated DeltaNet when the decay is scalar as well.

The authors also develop a fast-weight update perspective that preserves efficient chunkwise training through a WY matrix formulation, where channel-specific decays are absorbed into asymmetric rank-one factors for erase and write, ensuring compatibility with efficient vectorized kernels and backward passes.

Methodology

Detailed Architectural Advances

  1. Channel-wise Erase and Write Gates: Instead of using a single scalar to control how much to erase and write, Gated DeltaNet-2 learns independent channel-wise gates for each. The key-side erase gate determines which components of the memory state—projected onto the key dimension—should be overwritten, while the value-side write gate independently selects which value components are allowed to update the state. This breaks the restrictive coupling in earlier models, which required erasing and writing to occur along the same scalar degree.
  2. Mathematical Formulation: The recurrence is as follows:

St=(Iktet)DtSt1+ktzt,S_t = (I - k_t e_t^{\top}) D_t S_{t-1} + k_t z_t,

where et=btkte_t = b_t \odot k_t, zt=WtUtz_t = W_t \odot U_t, and DtD_t is diagonal channel-wise decay. This structure accommodates targeted editing along different coordinate axes for reading out and writing in.

  1. Efficient Parallel Training: Through the WY transformation, the authors demonstrate that the recurrence—incorporating channel-wise erase and write—admits an efficient matrix product and triangular solve structure, retaining high hardware and training efficiency in both forward and backward passes.
  2. Flexible Reductions: Gated DeltaNet-2 can recover KDA and Gated DeltaNet by appropriately tying its gates and decay, ensuring strict generality over prior channel- and scalar-gated variants.

Hybrid Model Integration

Gated DeltaNet-2 is deployed both as a pure recurrent stack and as the recurrent backbone in hybrid architectures with Sliding-Window Attention (SWA). The hybrid block sequence alternates the recurrent mixer with SWA and MLP layers, ensuring local evidence integration while the recurrent state manages globally persistent memory.

Empirical Results and Analysis

Language Modeling and General Reasoning

On multiple language modeling and commonsense reasoning benchmarks, Gated DeltaNet-2 demonstrates the strongest aggregate performance under controlled parameter and state-size settings versus Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants. In both recurrent-only and hybrid contexts, Gated DeltaNet-2 outperforms others on perplexity and accuracy metrics, confirming that gains arise from improved memory updates rather than increased memory capacity.

Long-Context Retrieval and Interference

The most significant improvements occur on long-context in-context retrieval tasks, specifically on RULER synthetic benchmarks. Gated DeltaNet-2 maintains high recall under severe association interference scenarios—especially in the multi-key setting, where competing associations are encoded in a fixed-state. This attests to the practical utility of decoupled erase and write gates for minimizing destructive interference and sustaining reliable content retrieval across extended sequences.

Real-World Benchmark Evaluation

Across several real-world extraction and question-answering datasets, Gated DeltaNet-2 achieves the best or near-best retrieval accuracy in both recurrent and hybrid model forms. The largest relative improvements are observed in tasks emphasizing robust memory recovery amid noise and distracting evidence. The hybrid variant maintains state-of-the-art recall as sequence lengths scale, confirming that SWA integration does not compromise the underlying model's long-context capabilities.

Ablation and Efficiency Studies

Ablation experiments show that both channel-wise erase and write gates contribute to the observed gains, though the erase gate carries more weight in the context of memory preservation and interference reduction. Prohibiting channel-wise structure in either gate (i.e., reverting to scalar gating) causes consistent degradation on both modeling and retrieval tasks. Throughput analysis confirms that Gated DeltaNet-2 retains near-linear scaling in hardware efficiency, with a negligible constant overhead attributable to its finer-grained gating mechanisms.

Implications and Future Directions

Gated DeltaNet-2 addresses a critical bottleneck in fixed-size recurrent sequence models: the destructive coupling between forgetting (erasing) and learning (writing) in compressed memory states. By enabling axis-aligned, channel-wise control over the content lifecycle, it minimizes association interference and supports more robust retrieval over long contexts. The architecture preserves the efficiency necessary for scalable training and dense deployment. Future work may explore dynamic adaptivity in gating (beyond channel-wise structure), gating policies informed by external memory or hierarchical cues, and broader applications beyond language tasks, such as vision transformers or time-series modeling where long-term memory fidelity under fixed-capacity constraints is essential. Integrating such decoupled gating with further advances in state-space models or leveraging hybrid recurrent-attention backbones may yield further improvements in efficiency, generalization, and scalability.

Conclusion

Gated DeltaNet-2 generalizes and unifies prior delta-rule linear attention models by decoupling the erase and write operations with independent channel-wise gating, significantly advancing the capacity of linear recurrent architectures to efficiently store, retrieve, and update associations in long sequences. It achieves superior empirical results in language modeling and retrieval under fixed memory, and establishes a new standard for efficient, interference-resistant recurrent attention layers in large-scale sequence modeling tasks (2605.22791).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Gated DeltaNet-2, explained simply

1) What is this paper about?

This paper is about making AI models remember long stories or documents better and faster. Usual Transformers remember by comparing every word to every other word, which gets very slow as the text gets longer. Gated DeltaNet-2 is a new way to remember that keeps a small “memory” the same size no matter how long the text is, while still finding the right facts when needed.

2) What questions does it try to answer?

The paper focuses on a few simple questions:

  • How can a model with a small, fixed memory remember long texts without mixing things up?
  • Can we control forgetting and writing into memory more precisely, so old facts aren’t accidentally ruined?
  • Can we do this without slowing training and inference?
  • Does this actually help on real tasks like language modeling and finding specific information in long contexts?

3) How does it work? (with simple analogies)

Think of the model’s memory like a small whiteboard the size never changes. Each new word in the text is like a student who:

  1. Reads what’s currently written at a specific “address” (the key),
  2. Decides what to erase,
  3. Decides what new stuff to write (the value),
  4. And slightly fades the old writing over time (decay), so very old notes slowly disappear unless refreshed.

Some basic ideas and the upgrades in this paper:

  • Queries, keys, and values:
    • Key = the “address label” for a memory slot.
    • Value = the “note” you store at that address.
    • Query = the “question” you ask the memory to get an answer.
  • Linear attention with a fixed state: Instead of keeping a huge list of past notes, the model keeps one small board (a matrix). Every token updates this board in constant space and linear time, so speed and memory stay under control as text grows.
  • The delta rule (targeted editing): Before writing new content, the model first reads what’s already stored for the current key and subtracts it. This is like erasing the specific old note at that address, then writing the new note. It prevents piling new notes on top of old ones and getting a mess.
  • What earlier models did:
    • Mamba-2: Adds a “fade” knob (decay) to make older content slowly disappear.
    • DeltaNet/Gated DeltaNet: Uses the delta rule plus a global gate to control how strongly to erase and write.
    • KDA (Kimi Delta Attention): Makes the fade (decay) smarter by tuning it separately for each “color channel” of the memory. Think of the memory note as having multiple colored layers; KDA can fade each color differently. But it still uses a single shared knob to control both erase and write strength.
  • What Gated DeltaNet-2 changes:
    • Two separate knobs instead of one:
    • Erase gate b_t (key side): a set of per-channel dimmer switches that decide which parts of the old note to erase. Different channels (think “colors” or “features”) can be erased more or less.
    • Write gate w_t (value side): another set of per-channel dimmer switches that decide which parts of the new note to write.
    • Channel-wise decay stays: each channel can fade at its own rate, like in KDA.
    • Why this helps: Erasing and writing are different actions. Sometimes you want to erase a lot but only write a little, or erase one set of features and write a different set. A single shared knob can’t do both well; two knobs give finer control.
  • Efficient training and inference:
    • Chunking: The model processes the sequence in small blocks (chunks) so it still trains in parallel on GPUs.
    • A math trick (WY form) and a “gate-aware” backward pass keep this efficient even with the new per-channel gates.
    • Result: Nearly the same speed as earlier linear-attention models, and far faster scaling than a standard Transformer as sequences get longer.

4) What did they find, and why does it matter?

Main takeaways:

  • Better long-context memory: On tough “needle-in-a-haystack” tests (find the exact item buried in very long text), Gated DeltaNet-2 does best overall, especially when there are many distracting keys and only one correct one. This is where precise erase vs. write control really matters.
  • Strong general performance: With a 1.3B-parameter model trained on 100B tokens, it gets the best overall results among similar models (like Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants) across:
    • Language modeling (predicting the next word),
    • Commonsense reasoning,
    • Retrieval from long contexts (both synthetic tests and real datasets).
  • Real-world retrieval: It achieves the best average on practical tasks like extracting facts from web pages or PDFs and answering questions when lots of distracting text is present.
  • Ablations (what matters most?): Both gates help, but the erase gate contributes most of the gain. This makes sense: if you can carefully remove the right old content, you avoid interference and keep the memory clean.
  • Efficiency: Training speed remains high and scales well with longer sequences, with only a small overhead compared to previous linear-attention models.

Why it matters:

  • Models that can remember long documents accurately without huge memory cost are crucial for tasks like long-form question answering, code understanding, medical or legal document analysis, and multi-step reasoning across many pages.

5) What’s the bigger impact?

  • Smarter memory editing: Decoupling erasing from writing is a simple but powerful idea. It reduces “interference,” where many facts crowded into a small memory blur together.
  • Practical for long inputs: You can handle thousands of tokens with steady memory use, making it useful for servers and possibly edge devices.
  • Plays well with hybrids: You can combine this fixed-memory recurrence for long-range information with a small sliding-window attention for precise local details. This keeps things fast and accurate.
  • Future directions: The same principle—more precise control over what to remove and what to add—could inspire even better memory systems, leading to models that are both efficient and reliable over very long contexts.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future work:

  • Scaling behavior beyond 1.3B parameters and 100B tokens: do the gains from decoupled erase/write persist or widen at larger scales (e.g., 7B–70B) and longer training runs (≥1T tokens), and how do depth/width, number of heads, dk/dv affect outcomes?
  • Ultra-long-context regime: the model is trained at 4K and evaluated up to 8K (with a 16K throughput plot); efficacy, stability, and interference control for 32K–1M contexts remain untested, as do training strategies for such lengths.
  • Decoding latency and end-to-end inference metrics: while a forward-only recurrent kernel is provided, the paper does not report decode tokens/s, latency under batching, cache/memory footprint, or beam search performance relative to KDA, Mamba-3, and Transformers.
  • Theoretical capacity and interference analysis: no formal characterization of memory capacity, retrieval error, or interference as a function of dk, dv, chunk size, and gate statistics; conditions for stable overwrite and non-interference under channel-wise erase/write are not derived.
  • Stability guarantees: absent analysis of convergence and spectral stability when combining channel-wise decay with asymmetric erase factors (including the [0,2] erase scaling); criteria to prevent oscillation or drift in long streams are not established.
  • Gate parameterization design space: only sigmoid gating is explored; alternatives (e.g., softmax/entmax over channels, temperature scheduling, sparsity/entropy regularizers, hard gating, low-rank/shared gate factors) and their effects on capacity, stability, and efficiency are open.
  • Gate dynamics and interpretability: no analysis of learned gate patterns across layers/heads/tokens (e.g., which channels are erased vs. written for which structures), nor causal probes to validate the claimed selective erase/write behavior.
  • Sensitivity to chunk size and algorithmic hyperparameters: chunk size C is fixed (64); effects of varying C, adaptive/dynamic chunking, and trade-offs between accuracy, numerical error in the triangular solve, and throughput are not studied.
  • Precision and quantization robustness: the approach relies on fp32 accumulators and fp32 triangular solves; resilience to bf16/fp8/int8 (training and inference), mixed-precision triangular solves, and quantization-aware training is not evaluated.
  • Numerical stability at extreme lengths: the decay accumulation and triangular inverse are recognized as precision-sensitive but not stress-tested on very deep sequences, very small/large gates, or low-precision hardware configurations.
  • Hybrid architecture design choices: SWA window is fixed at 2K; no ablations on window size, adaptive/learned windows, or removal of SWA to quantify division-of-labor between recurrent memory and local attention across tasks.
  • Generalization across domains and modalities: evaluations focus on English LM and retrieval; performance on multilingual corpora, code, math, long-form reasoning, and multimodal settings is unknown.
  • Fine-tuning and instruction-following: the impact of decoupled gates under supervised fine-tuning, instruction tuning, RLHF, and tool/retrieval-augmented generation has not been assessed.
  • Robustness and adversarial stress-tests: beyond RULER and selected retrieval tasks, robustness to noisy inputs, distractor overload, adversarial prompts, and domain shift (e.g., document structure perturbations) is not systematically measured.
  • Fairness of baseline tuning: Mamba-3 MIMO rank is fixed at R=4 and other baselines may be under-tuned; sensitivity of conclusions to stronger baseline hyperparameter sweeps (rank, state size, discretization choices) is unverified.
  • Grouped value head design: the paper repeats q, k, g, b across value-head groups; the trade-off between grouped vs. independent gates per value group, and the impact on capacity/efficiency, is not ablated.
  • External-memory and retrieval augmentation: interoperability with explicit retrieval (kNN, memory tables), key-value caches, or tool-use pipelines to further reduce interference is unexplored.
  • Composition with other recurrent advances: compatibility and synergies with MIMO fast-weights, complex rotations (as in Mamba-3), or multi-input/multi-output delta updates have not been tested.
  • Regularization of gates: potential gate saturation (collapse to 0/1), temporal smoothness, or sparsity constraints and their effect on stability and generalization are not examined.
  • Negative-eigenvalue variant: expanding bt to [0,2] showed no clear gain at 1.3B; when, why, and at what scale this variant helps (or harms) stability and retrieval remains unclear.
  • Memory utilization diagnostics: no direct measurements of collision rates, effective capacity (e.g., #distinct associations retained), or per-layer memory usage over long sequences in naturalistic data.
  • Data scaling and curriculum: training uses FineWeb-Edu with 4K sequences; curricula for progressively longer contexts, mixture-of-length training, and their interaction with gate learning are open.
  • Hardware portability and kernel generality: performance and numerical behavior on non-Hopper GPUs, non-NVIDIA hardware, and different compiler stacks are not reported; autotuning coverage and failure modes are not mapped.
  • Safety, privacy, and calibration: the effects of decoupled memory edits on hallucination rates, calibration, and memorization/privacy risks are not addressed.

Practical Applications

Immediate Applications

The following applications can be built or piloted now by leveraging the published code, kernels, and training recipe for Gated DeltaNet-2, which preserves linear-time training/inference with constant memory and improves long-context retrieval and recall.

  • Bold title: Long-context enterprise document QA and summarization (contracts, policies, manuals)
    • Sector: Legal, Enterprise software
    • What it does: Answers questions and produces summaries over very long documents or multi-document packs without quadratic attention cost, reducing interference among many references via decoupled erase/write gates.
    • Tools/products/workflows: Hybrid Gated DeltaNet-2 + Sliding-Window Attention (SWA) block as a drop-in token mixer in an LLM; document ingestion pipeline with chunking and optional vector store; GPU-serving with fixed-size recurrent state instead of growing KV cache.
    • Assumptions/dependencies: Performance reported at 1.3B on FineWeb-Edu; domain adaptation/fine-tuning recommended for legal corpora; best throughput on NVIDIA GPUs with Triton kernels; long-context formatting still benefits from SWA for local comparisons.
  • Bold title: Retrieval-augmented generation (RAG) with improved context packing and recall
    • Sector: Software, Search, Customer support
    • What it does: Packs more retrieved passages per request with constant memory inference; decoupled erase/write reduces interference in distractor-heavy RAG, improving answer recall and grounding.
    • Tools/products/workflows: Swap the Transformer attention mixer for Gated DeltaNet-2 in a RAG stack; preserve SWA for local reasoning; maintain smaller or no KV cache; integrate with existing retrievers and rerankers.
    • Assumptions/dependencies: RAG quality also depends on retriever and prompt design; verify task-specific gains since benchmarks in the paper focus on recall-heavy settings.
  • Bold title: PDF/HTML key–value extraction at scale
    • Sector: Document AI, Compliance, Finance
    • What it does: Improves structured field extraction from long PDFs and web pages (matches gains on FDA and SWDE) with stable recall over long, noisy contexts.
    • Tools/products/workflows: Batch ETL pipeline feeding Gated DeltaNet-2-based models for field extraction; constant memory enables higher document lengths per GPU.
    • Assumptions/dependencies: Domain fine-tuning improves robustness; layout-aware preprocessing still helpful.
  • Bold title: Multi-turn customer support and CRM assistants with long session memory
    • Sector: CX, Sales/CRM
    • What it does: Maintains and edits long dialogue histories efficiently, reducing memory interference across many user issues within the same session.
    • Tools/products/workflows: Replace attention module in existing assistants; maintain a small recurrent state across turns; log-aware SWA for local slot filling.
    • Assumptions/dependencies: Guardrails and conversation safety layers still required; cross-session identity linking is a separate system concern.
  • Bold title: Code assistants over large repositories and long files
    • Sector: Developer tools
    • What it does: Handles long code contexts (multi-file diffs, large notebooks) with constant memory; decoupled erase/write helps disambiguate many symbol associations.
    • Tools/products/workflows: Use hybrid Gated DeltaNet-2 blocks in code LLMs; plug into IDEs; optionally combine with file-level retrieval.
    • Assumptions/dependencies: Requires code-pretrained checkpoints; static analysis and LSP signals remain complementary.
  • Bold title: Streaming meeting/call transcription and summarization
    • Sector: Productivity, Enterprise IT
    • What it does: Processes hours-long transcripts in a streaming fashion with fixed memory; selectively forgets stale context while writing salient updates.
    • Tools/products/workflows: Real-time ASR followed by Gated DeltaNet-2 summarizer; chunkwise processing with minimal buffering; SWA for short-range discourse links.
    • Assumptions/dependencies: ASR quality limits ceiling; tune decay/erase behavior for latency/recall trade-offs.
  • Bold title: Log and telemetry analytics (security, reliability)
    • Sector: Observability, Security
    • What it does: Long-horizon pattern detection and anomaly explanation in streams of events without growing memory footprint.
    • Tools/products/workflows: Online inference with the provided recurrent decoding kernel; alerting workflows consuming model outputs.
    • Assumptions/dependencies: Domain adaptation and labeling strategy needed; privacy/governance controls for log data.
  • Bold title: Cost/performance optimization for LLM serving
    • Sector: Cloud/Inference platforms
    • What it does: Reduces or eliminates the per-request KV cache growth by using a fixed-size state, enabling more concurrent users or longer contexts per GPU.
    • Tools/products/workflows: Retool serving stack to store a small fp32 recurrent state; throughput tuning with Triton kernels; use hybrid blocks to match Transformer quality on local tasks.
    • Assumptions/dependencies: Gains depend on request length distribution; some tasks still favor global attention if exact token–token interactions dominate.
  • Bold title: Edge inference for long-context NLP on NVIDIA Jetson-class devices
    • Sector: Embedded/Edge AI
    • What it does: Brings long-context summarization or QA to constrained devices using constant-memory recurrence.
    • Tools/products/workflows: Quantized Gated DeltaNet-2 variants; on-device chunkwise processing; local data privacy.
    • Assumptions/dependencies: Kernel support and memory bandwidth on target hardware; potential accuracy loss from aggressive quantization.
  • Bold title: Academic baseline for long-context memory interference research
    • Sector: Academia/Research
    • What it does: Provides a competitive open baseline with controllable gates for studying interference, forgetting, and fast-weight dynamics.
    • Tools/products/workflows: Use the released code and ablations (channel vs scalar gates, erase range) to design experiments and coursework.
    • Assumptions/dependencies: Reproducing results requires matched training recipe and hardware; benchmarks like RULER, LAMBADA, and real-world retrieval included.
  • Bold title: Framework integration and kernels for practitioners
    • Sector: Software tooling
    • What it does: Incorporates the gate-aware WY chunkwise algorithm and fused Triton kernels into PyTorch/JAX ecosystems for wider adoption.
    • Tools/products/workflows: Package as an attention drop-in; expose knobs for chunk size, SWA window, precision flags.
    • Assumptions/dependencies: Maintenance of custom kernels; compatibility with BF16/FP32 and future GPUs; test coverage for variable-length batches.

Long-Term Applications

These applications are promising but may require larger-scale training, multimodal integration, regulatory work, or hardware/compiler co-design.

  • Bold title: Lifelong, privacy-preserving personal assistants with controllable forgetting
    • Sector: Consumer AI, Privacy
    • What it could do: Maintain years-long histories with explicit, per-channel erase controls aligned to user preferences (e.g., “forget financial details”).
    • Tools/products/workflows: Policy-aware gating APIs; on-device or federated deployment with constant memory.
    • Assumptions/dependencies: Stronger safety/alignment; UI and policy layers to surface and verify forgetting; larger models and personalization.
  • Bold title: Longitudinal clinical summarization and decision support
    • Sector: Healthcare
    • What it could do: Reason over multiyear EHR timelines, selectively retaining clinically salient signals while decaying noise.
    • Tools/products/workflows: Fine-tuned medical LLMs with Gated DeltaNet-2 mixers; integration with EHR systems; audit logs of memory edits.
    • Assumptions/dependencies: Regulatory compliance (HIPAA/GDPR); medical pretraining; rigorous validation; human-in-the-loop oversight.
  • Bold title: Robotics and autonomy with long-horizon memory
    • Sector: Robotics, Automotive
    • What it could do: Stream sensor/action histories while editing state to prevent memory interference; support task decomposition over long horizons.
    • Tools/products/workflows: Multimodal recurrent blocks (vision/audio/text) using decoupled gates; control stacks integrating fast-weight memory.
    • Assumptions/dependencies: Multimodal extensions and real-time guarantees; sim2real transfer; safety certifications.
  • Bold title: Financial analytics and compliance monitoring at scale
    • Sector: Finance, RegTech
    • What it could do: Scan continuous streams (filings, chats, trades) with selective erase/write to track entities and obligations across long contexts.
    • Tools/products/workflows: Domain-adapted checkpoints; compliance dashboards; event-driven pipelines.
    • Assumptions/dependencies: High-stakes accuracy; explainability of memory edits; robust handling of adversarial inputs.
  • Bold title: Ultra-long context LLMs (100K–1M tokens) with constant memory
    • Sector: Foundation models
    • What it could do: Train and serve models handling book-length inputs and multi-episode histories without quadratic cost.
    • Tools/products/workflows: Scaled Gated DeltaNet-2 layers; curriculum for long-context training; memory diagnostics for interference.
    • Assumptions/dependencies: Larger models and datasets; stability at extreme lengths; improved chunkwise solvers and precision controls.
  • Bold title: Hybrid architectures that combine decoupled delta-rule memory with SSM rotations (e.g., Mamba-3 MIMO)
    • Sector: AI architecture R&D
    • What it could do: Merge channel-wise erase/write with data-dependent rotations for richer dynamics and better decoding latency.
    • Tools/products/workflows: New blocks that integrate WY updates with SSM inputs; kernel fusion strategies.
    • Assumptions/dependencies: Nontrivial kernel and backward-pass complexity; careful state-size and latency trade-offs.
  • Bold title: Continual learning with explicit associative memory editing
    • Sector: ML research, Edge AI
    • What it could do: Task-adaptive fast weights with selective erasure of stale associations, mitigating catastrophic forgetting.
    • Tools/products/workflows: Training curricula that drive gate policies; evaluation on lifelong learning suites.
    • Assumptions/dependencies: Stable optimization with gate dynamics; monitoring tools for interference and drift.
  • Bold title: Energy- and cost-aware AI policy and procurement
    • Sector: Public policy, Sustainability
    • What it could do: Favor linear-time, constant-memory models for long-context workloads to lower energy/use-phase emissions and hardware cost.
    • Tools/products/workflows: Benchmarks and reporting standards for energy per token vs. context length; procurement guidelines.
    • Assumptions/dependencies: Transparent energy metrics; comparable quality benchmarks across architectures.
  • Bold title: Multimodal long-context understanding (video/audio+text)
    • Sector: Media, Safety
    • What it could do: Handle hours-long video transcripts and audio streams with selective memory editing (e.g., tracking characters/threads).
    • Tools/products/workflows: Tokenization and gating per modality; fusion layers; temporal SWA for local correlations.
    • Assumptions/dependencies: Robust multimodal training; efficient tokenization; licensing for media datasets.
  • Bold title: Hardware and compiler co-design for gate-aware WY kernels
    • Sector: Semiconductors, Systems
    • What it could do: Accelerate triangular solves and gate-aware dot products (A = (I+T){-1}) in tensor cores/ASICs; standardized ops in cuDNN/XLA.
    • Tools/products/workflows: Primitive support for lower-triangular solves with mixed precision; autotuning for fused forward/backward kernels.
    • Assumptions/dependencies: Vendor buy-in; sustained demand for recurrent linear mixers; correctness and numerical stability guarantees.
  • Bold title: Safety and privacy-by-design via controllable erase semantics
    • Sector: Trust & Safety
    • What it could do: Implement audited “forgetting” at the fast-weight memory level to reduce accidental leakage across prompts or users.
    • Tools/products/workflows: Telemetry of gate activations; policy constraints on erase ranges; red-teaming frameworks targeting interference.
    • Assumptions/dependencies: Careful separation of per-user states; formalization of memory-edit guarantees; interaction with higher-level caches.

Notes on cross-cutting dependencies and assumptions

  • Training and hardware: Reported results use 1.3B parameter models trained on 100B tokens and evaluated on NVIDIA GPUs with Triton-based fused kernels and chunk size C=64; reproducing efficiency assumes similar hardware and kernel availability.
  • Hybrid design: For many tasks, pairing the recurrent mixer with SWA is important to capture exact local interactions; recurrent-only variants may underperform on purely local reasoning.
  • Stability/precision: L2-normalized queries/keys, fp32 state/accumulators, and careful triangular solve precision are part of the recipe; deviations can affect long-context stability.
  • Domain adaptation: Task/domain fine-tuning is recommended for regulated or specialized use (healthcare, legal, finance).
  • Benchmarks vs. production: The strongest gains are on long-context retrieval and interference-heavy settings; validate on your production distribution before wholesale migration.

Glossary

  • Asymmetric delta recurrence: A recurrence update where the erase and write directions are asymmetric due to gating and decay normalization. "Eq. 10 becomes a pure asymmetric delta recurrence,"
  • Autoregressive decoding: Generating tokens one by one, conditioning on previously generated tokens, often with a recurrent kernel for inference. "A forward-only recurrent kernel is provided for autoregressive decoding at short sequence lengths."
  • bfloat16: A 16-bit floating-point format with 8-bit exponent and 7-bit mantissa, used to speed training with acceptable precision. "In bfloat16, the error follows the bfloat16 mantissa."
  • Causal mask: A masking matrix that enforces causality by preventing attention to future tokens. "where M is the causal mask."
  • Causal score matrix: The masked attention score matrix ensuring each position only attends to past positions. "Define the causal score matrix"
  • Channel-wise decay: Forgetting coefficients applied per channel (dimension) rather than as a single scalar, enabling finer control of memory retention. "channel-wise decay absorbed into asymmetric erase factors"
  • Complex-valued state transitions: State updates that use complex numbers (e.g., rotations) in state-space models to increase expressivity. "complex-valued state transitions"
  • Data-dependent decay: A decay factor that is computed from the input data to control forgetting dynamically. "Mamba-2 uses data-dependent decay to regulate the memory horizon [8]."
  • Data-dependent rotations: Input-driven rotations applied to the state (often in complex SSMs) to enhance modeling capacity. "data-dependent rotations"
  • Decay-normalized state: A state reparameterization that absorbs cumulative decay into the state for efficient computation. "Define the decay-normalized state S, by S, = Diag(r)S,."
  • Delta rule: An update that subtracts the current read before writing the new value, performing a residual correction in memory. "DeltaNet replaces additive writes with the delta rule, enabling targeted overwrite"
  • Exponential-trapezoidal discretization: A numerical integration scheme for discretizing continuous-time state-space models that blends exponential and trapezoidal rules. "exponential-trapezoidal discretization, complex-valued state transitions, and a multi-input, multi-output formulation"
  • Fast-weight memory: A transient, rapidly updated associative memory implemented via fast-weight updates during sequence processing. "an online update of a fast-weight memory state"
  • Gate-aware backward pass: A backpropagation method that explicitly accounts for per-channel gates inside matrix products to compute correct gradients. "a gate-aware backward pass that preserves efficient parallel training."
  • Gated Delta Rule-2: The decoupled delta-rule update with separate channel-wise erase and write gates operating on key and value axes, respectively. "We refer to Eq. 10 as Gated Delta Rule-2."
  • Hebbian-style accumulation: A learning rule that accumulates associations additively, inspired by Hebbian plasticity, often contrasted with delta-rule edits. "improving associative memory over Hebbian-style accumulation"
  • Log-decay: The logarithm of the decay factors, used for numerical stability when accumulating decays across long sequences. "The log-decay follows the Gated DeltaNet parameterization,"
  • Multi-input, multi-output (MIMO): A formulation where multiple inputs drive multiple outputs per step, increasing expressivity of the recurrence. "and a multi-input, multi-output formulation for stronger and more efficient recurrence [13]."
  • Needle-In-A-Haystack (NIAH): Benchmarks that test long-context retrieval by hiding a “needle” among many distractors. "Single Needle-In-A-Haystack (S-NIAH) and Multi-Key Needle-In-A-Haystack (MK-NIAH) tasks from RULER."
  • Negative-eigenvalue variant: A modification allowing negative eigenvalues in the state transition, affecting stability and spectrum. "We also support the negative-eigenvalue variant of [20]"
  • Projector (in linear algebra): A matrix that idempotently projects onto a subspace; here, rank-one k kᵀ when the key is unit-normalized. "the matrix kkt is a projector,"
  • RMSNorm: Root Mean Square Layer Normalization, a normalization technique without mean subtraction. "the output is passed through an RMSNorm and SiLU gate"
  • RULER: A suite for evaluating long-context retrieval and interference control in LLMs. "On the RULER needle-in-a-haystack tasks in Table 3,"
  • Sliding-Window Attention (SWA): Attention restricted to a fixed local window to keep computation and memory linear in sequence length. "Sliding-Window Attention (SWA)"
  • State-space model (SSM): A model class that represents sequences via latent states with linear dynamics and learned inputs/outputs. "the complex SSM view"
  • Triangular solve: Solving a (lower/upper) triangular linear system, used here for forward substitution in chunkwise computations. "The triangular solve for A = (I + T)-1 is the most precision-sensitive part of the chunk computation."
  • Triton kernels: GPU kernels written in the Triton language to fuse and accelerate custom tensor operations. "fused Triton kernels"
  • UT transform: A specific linear-algebra transform used to accelerate computations in the chunkwise algorithm. "We use the UT transform [22]"
  • Vector-Jacobian product: The operation used in reverse-mode autodiff to propagate gradients efficiently. "The inverse itself has the standard triangular vector-Jacobian product"
  • WY form: A compact factorization (I − UYᵀ) used to represent products of rank-one updates efficiently. "the recurrence admits a compact WY form"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 302 likes about this paper.