Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

Published 6 May 2026 in cs.LG | (2605.05495v1)

Abstract: Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a canonical feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find evidence supporting the hypothesis that ALBERT, a recurrent version of BERT, learns a For loop-esque solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of training strategies that combine data across experiences, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that the recurrent ALBERT model may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge in modern models and tasks.

Summary

  • The paper demonstrates that feedforward Transformers (BERT) rely on shortcut solutions that limit forward transfer in continual compositional tasks.
  • It reveals that recurrent ALBERT develops algorithmic attention patterns, enabling more robust generalization and systematic reasoning over sequential tasks.
  • Replay buffers and incremental data mixing are shown to mitigate catastrophic forgetting, although neither architecture fully integrates knowledge across experiences.

Shortcut Solutions Impair Continual Compositional Reasoning in Transformers: An Expert Analysis

Introduction

This paper addresses the intersection of continual learning (CL) and compositional reasoning in Transformer neural network architectures, with a focus on how model architecture influences the ability to generalize and transfer knowledge across sequential tasks. Centering on the Learning Equality and Group Operations (LEGO) synthetic reasoning framework, the study examines the CL performance of two canonical Transformer variants: feedforward BERT and recurrent ALBERT. The principal contribution is a systematic investigation of whether these architectures leverage generalizable solutions or resort to shortcut solutions, and how this impacts continual learning and forward transfer across compositional subproblems.

The Continual LEGO Task Design

The LEGO task employs the D3D_3 symmetry group to construct well-defined, theoretically grounded subproblems (flip-flop experiences) with recurring algebraic structure, thus enabling controlled probing of compositional reasoning (Figure 1). Figure 1

Figure 1: The structure of the D₃ LEGO task—illustrating its symmetries, task instantiation as group operations, and equivalence to directed graph traversal.

The continual LEGO adaptation introduces three sequential experiences, each a subtask with nearly isomorphic algebraic structure, reflecting realistic requirements for continual adaptation and transfer while minimizing confounds from distributional shift.

Main Results: Architectural Inductive Bias and Emergent Computations

Sequential training on the continual LEGO tasks reveals a stark architectural divergence. Both BERT and ALBERT achieve perfect in-distribution accuracy on the first flip-flop experience. However, only ALBERT displays strong generalization to longer sequences and robust forward transfer to subsequent experiences, with BERT languishing at chance for compositional generalization and showing inconsistent sample efficiency across seeds and model sizes (Figure 2). Figure 2

Figure 2: BERT and ALBERT performance on the continual LEGO task across three flip-flop experiences, highlighting forward transfer and catastrophic forgetting.

A hyperparameter sweep further elucidates these architectural differences (Figure 3): ALBERT networks demonstrate monotonic improvement in generalization and forward transfer as model depth grows, whereas BERT's capacity does not correlate cleanly with performance, and catastrophic forgetting is ubiquitous in both architectures. Figure 3

Figure 3: Relationship between Transformer depth/width and continual LEGO performance metrics—task accuracy, generalization, forward transfer, and performance retention—for BERT and ALBERT.

Analysis of attention patterns in minimal architectures (single-head, six-layer variants) demonstrates qualitative distinctions in learned computation (Figure 4). ALBERT develops attention regimes resembling algorithmic For loops, attending systematically to preceding clauses, thereby supporting reusable, compositional computation. In contrast, BERT converges on shortcut solutions, wherein later layers fixate attention on a single token from the initial clause, bypassing the algebraic structure of the problem and thereby failing to generalize beyond the training regime. Figure 4

Figure 4: Attention pattern analyses for minimal ALBERT and BERT, demonstrating algorithmic For loop-like solutions in ALBERT versus shortcut solutions in BERT.

Catastrophic Forgetting and the Limits of Replay

Both architectures suffer from catastrophic forgetting, losing nearly all prior experience performance when learning new subtasks. Replay buffers, a canonical CL intervention, mitigate this effect, restoring performance on previous experiences without degrading forward transfer in both architectures when implemented with modest memory fractions (Figure 5). Replay stabilizes learned attention patterns, as measured via attention similarity metrics pre- and post-replay. Figure 5

Figure 5: Replay buffer efficacy in mitigating catastrophic forgetting in ALBERT and BERT minimal models, shown by accuracy retention and attention similarity over experiences.

Compositional Generalization: Incremental and Generative Data Mixing

Despite the benefits of replay, neither BERT nor ALBERT can fully integrate knowledge across experiences when the test set demands compositional reasoning that bridges independently acquired subtasks. This failure persists when only naive episodic replay is used. For ALBERT, training on incrementally combined or generatively replayed experience data substantially rescues compositional generalization performance, but BERT remains limited by entrenched shortcut behaviors even with such interventions (Figure 6). Figure 6

Figure 6: Effects of naive replay and incrementally combined experiences on the ability of minimal ALBERT and BERT to generalize compositionally across the full task.

Theoretical and Practical Implications

These results supply clear mechanistic evidence that recurrent inductive biases, as instantiated in ALBERT, enable the formation of generalizable, compositional routines that favor continual adaptation and transfer in structured reasoning domains. Feedforward BERT, in contrast, is prone to memorization strategies that undercut compositional learning. The work establishes that even when global test accuracy can be superficially achieved, underlying attention mechanisms may represent fundamentally distinct (and not equivalently adaptable) solutions.

Practically, this exposes limitations in relying on standard feedforward Transformer architectures for scenarios requiring transfer, continual updating, and compositional reasoning—settings ubiquitous in both real-world applications and theoretical cognitive models. Architectures embedding recurrence or weight-sharing thus warrant prioritization in CL deployments requiring robust reasoning over sequential, related tasks.

From a theoretical standpoint, the results highlight a mismatch between model expressiveness and the ability (or tendency) to converge on structurally invariant solutions, reinforcing the significance of task-aligned inductive biases and interpretation of emergent circuit-level mechanisms.

Future Perspectives

The study motivates several future directions: (1) rigorous causal dissection and surgical intervention in attention mechanisms to establish necessity/sufficiency for algorithmic generalization; (2) porting the continual LEGO framework to autoregressive and larger-scale models, thereby examining if and how shortcut behaviors persist under the training regimes of modern LLMs; (3) systematic comparison and hybridization of alternative CL mechanisms beyond replay, such as modularity, distillation, or adaptive plasticity, and (4) generalization of findings to real-world, non-synthetic compositional tasks such as multi-step logical inference or graph navigation.

Conclusion

This paper provides compelling empirical and mechanistic evidence that shortcut solutions learned by Transformer architectures can significantly impair continual and compositional reasoning, with pronounced negative effects in feedforward variants like BERT. Recurrent architectures like ALBERT serve as a partial remedy by encouraging algorithmic, transferable computation strategies, although they too have compositional limitations without sophisticated CL interventions such as incremental data mixing. The findings underscore the complexity of continual compositional learning in deep architectures and offer concrete guidance for the design and analysis of future models intended for robust, reasoning-centric, lifelong learning scenarios.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.