- The paper introduces HyperMixer, an MLP-based architecture that dynamically generates token mixing weights via hypernetworks to replace conventional attention mechanisms.
- It achieves linear computational complexity and competitive performance on NLU benchmarks like GLUE, excelling in low-resource settings and on tasks such as QNLI.
- The approach simplifies hyperparameter tuning and promotes energy-efficient Green AI practices, making it a practical and low-cost alternative for real-world applications.
This paper introduces HyperMixer, an MLP-based architecture proposed as a low-cost alternative to Transformers for Natural Language Understanding (NLU) tasks. The motivation stems from the significant computational cost, data requirements, and hyperparameter tuning effort associated with large Transformer models, aligning with the concept of "Green AI".
HyperMixer builds upon the MLPMixer architecture, which uses separate MLPs for feature mixing (applied per token) and token mixing (applied per feature across tokens). However, the standard MLPMixer has limitations for NLP due to its fixed-size token mixing MLP and position-specific weights, making it unsuitable for variable-length inputs and lacking position invariance necessary for generalization in language tasks.
HyperMixer addresses these limitations by employing hypernetworks to dynamically generate the weights of the token mixing MLP. This "HyperMixing" mechanism acts as a drop-in replacement for the attention mechanism in a standard Transformer encoder layer (Figure 1 in the paper illustrates this). Instead of learning a fixed weight matrix for token mixing, HyperMixer learns smaller hypernetworks that take token representations as input and output the weights for the token mixing MLP.
The core idea of HyperMixing is to generate the weight matrices W1​ and W2​ for the token mixing MLP TM−MLP(x)=W2​(σ(W1T​x)) dynamically based on the input tokens. The paper describes this dynamic generation using two hypernetworks, h1​ and h2​. Specifically, h1​ and h2​ are implemented such that the rows of the weight matrices are generated independently from each input token's representation, potentially combined with token information like position embeddings.
The pseudocode provided in the paper for the HyperMixing layer demonstrates this process:
W2​4python
class HyperMixing (nn.Module) :
def init (self, d, d') :
# Learnable parameters: Hypernetworks are MLPs that output weights
# hypernetwork_in generates W1, hypernetwork_out generates W2
self.hypernetwork_in = MLP([d, d, d']) # Example MLP structure: input d, hidden d, output d'
self.hypernetwork_out = MLP([d, d, d']) # Example MLP structure: input d, hidden d, output d'
# Layer normalization for stability
self.layer_norm = LayerNorm(d) # Applied to the final output of the block
def forward(self, queries, keys, values) :
# queries: [B, M, d] - sequence length M, feature dim d
# keys / values: [B, N, d] - sequence length N, feature dim d
# Add token information (e.g., position embeddings) to keys and queries
# This allows hypernetworks to be aware of position or other token properties
hyp_in = add_token_information(keys) # e.g., keys + position_embeddings_k
hyp_out = add_token_information(queries) # e.g., queries + position_embeddings_q
# Generate weight matrices W1 and W2 dynamically using hypernetworks
# hypernetwork_in is applied per token in hyp_in [B, N, d] to output [B, N, d']
W1 = self.hypernetwork_in(hyp_in) # Shape [B, N, d']
# hypernetwork_out is applied per token in hyp_out [B, M, d] to output [B, M, d']
W2 = self.hypernetwork_out(hyp_out) # Shape [B, M, d']
# The TM-MLP operates on the transposed values [B, d, N]
# For each feature dimension, it applies W2 @ GELU(W1.transpose(0,1) @ feature_vector)
# where feature_vector is B, N
# compose_TM_MLP implements this per-feature MLP logic using the generated W1 and W2
token_mixing_mlp = compose_TM_MLP(W1, W2)
# Transpose values to apply TM-MLP across the sequence dimension N
values = values.transpose(1, 2) # Shape [B, d, N]
# Apply the dynamic token mixing MLP
output = token_mixing_mlp(values) # Shape [B, d, M]
# Transpose back to standard sequence-first shape
output = output.transpose(1, 2) # Shape [B, M, d]
# Optionally apply LayerNorm to the output of the mixing component (as done in Alg 1)
return self.layer_norm(output)
This dynamic weight generation provides HyperMixer with the necessary inductive biases for NLP:
- Adaptive Size: The hypernetworks generate weights of size proportional to the input sequence length, handling variable inputs.
- Position Invariance: The hypernetworks' MLPs process each token independently (after potentially adding position info), and the generated weights are used across the sequence dimension in a consistent manner, making the core mixing operation position-invariant similar to attention.
- Global Receptive Field: The token mixing MLP mixes information across the entire sequence length.
- Dynamicity: The mixing weights are a function of the input sequence itself.
Empirical evaluation on GLUE benchmark tasks shows that HyperMixer performs better than other MLP-based models and achieves performance on par with or slightly better than vanilla Transformers, particularly excelling on QNLI.
More importantly, HyperMixer demonstrates substantially lower costs according to the Green AI metric Cost(R) = E * D * H:
- Processing Time (E): HyperMixer has linear complexity O(N⋅d⋅d′) with respect to input length N (compared to O(N2⋅d) for standard self-attention), leading to faster wall-clock time, especially for long sequences (Figure 2). This makes it suitable for applications requiring low-latency inference or processing very long documents.
- Training Data (D): HyperMixer shows a larger relative performance improvement over Transformers in low-resource settings (using only 10% of training data) (Figure 3), suggesting better data efficiency. This is beneficial when training data is limited.
- Hyperparameter Tuning (H): HyperMixer is empirically shown to be easier to tune than Transformers, achieving higher expected validation performance at lower tuning budgets (Figure 4). This reduces the computational cost and time spent on hyperparameter optimization.
Experiments on a synthetic task further illustrate that HyperMixer learns attention-like patterns of interaction between tokens, supporting the idea that its architecture captures effective inductive biases for modeling relationships in sequences.
Implementation Considerations:
- Architecture: HyperMixer can generally replace the multi-head self-attention block in a standard Transformer encoder layer. It typically consists of alternating HyperMixing and feature mixing (MLP) blocks, with Layer Normalization and skip connections.
- Hypernetwork Implementation: The hypernetworks (
hypernetwork_in and hypernetwork_out) are standard MLPs (e.g., two linear layers with GELU activation). They should be designed to process individual token representations and output vectors that form the rows/columns of the TM-MLP weights.
- TM-MLP Implementation: The
compose_TM_MLP part requires careful implementation to apply the dynamic weights. It involves transposing the input sequence (values), performing matrix multiplications with the generated W2​0 and W2​1 weights (potentially batch-wise across features), and applying the GELU non-linearity. Ensure correct handling of batch dimensions and matrix shapes.
- Position Information: Adding positional embeddings (learned or fixed) to the token representations before feeding them to the hypernetworks is crucial for the model to utilize positional information.
- Normalization and Layout: Layer Normalization significantly improves training stability for HyperMixer, especially when using different Transformer layer layouts (pre-norm, post-norm, etc.). Adding LayerNorm after the HyperMixing component is recommended (Appendix F).
- Tied Hypernetworks: Tying
hypernetwork_in and hypernetwork_out (i.e., W2​2) reduces parameter count and was found beneficial in the paper's low-resource setting.
Trade-offs:
- Parameter Count: While HyperMixer can be configured to have a similar parameter count to Transformers, the number of parameters in the hypernetworks and the hidden dimension W2​3 of the TM-MLP affect efficiency and capacity. The tied hypernetwork configuration offers a good balance.
- Complexity vs. Expressiveness: HyperMixer's linear complexity comes from its MLP-based mixing. While shown to learn attention-like patterns, whether this mechanism is as universally powerful or expressive as quadratic attention in all scenarios (especially very large scale pretraining) remains an open question.
Limitations and Future Work:
The study primarily focuses on small models trained on limited data. Scaling HyperMixer to billions of parameters and pretraining on massive corpora, akin to LLMs, is necessary to confirm its efficiency benefits in that regime. Adapting HyperMixing for generative tasks requiring causal masking (like standard language modeling with decoder-only architectures) needs significant modeling advancements, which is highlighted as promising future work. Evaluating HyperMixer on a wider range of NLP tasks and domains is also needed to establish its versatility compared to Transformers.
In summary, HyperMixer presents a compelling MLP-based approach to NLU that achieves competitive performance with Transformers while offering significant advantages in terms of computational cost, data efficiency, and ease of tuning, particularly valuable for low-resource scenarios and promoting Green AI principles.