- The paper introduces a novel adaptive semantic ID framework that handles collisions through semantic-adaptive relaxation and load-adaptive strengthening.
- It leverages a two-stage regulation system to enhance codebook utilization, achieving approximately 6–10% gains in ranking metrics on benchmark datasets.
- The framework’s industrial evaluation demonstrates significant uplifts in GMV, orders, and GPM in a production-scale A/B test for multimodal recommendation.
Adaptive Semantic ID Learning for Multimodal Recommendation: The AdaSID Framework
Introduction
Industrial-scale recommender systems are increasingly dependent on the effective representation of items with rich multimodal signals—including text, images, and video. Semantic IDs (SIDs), which discretize multimodal features into short token sequences, have emerged as a practical solution, offering a more semantically informed alternative to sparse ID embeddings. However, learning discriminative and recommendation-aligned SIDs remains challenging, primarily due to the problem of discrete space collisions—distinct items sharing identical or overly similar SIDs. Addressing these challenges necessitates novel approaches for adaptive collision handling and codebook utilization in the SID space. This work presents AdaSID, an adaptive SID learning framework, that introduces two-stage adaptive overlap regulation and demonstrates measurable gains in both offline and industrial online settings.
The central issue in SID learning is that fixed, static treatments of SID collisions are overly rigid. Collisions may arise from semantic ambiguity or from benign multimodal similarity, necessitating differentiated regulation. Furthermore, uniform penalty allocation across collisions fails to model the highly heterogeneous collision load and evolving training dynamics in large-scale datasets.
Figure 1: Static overlap regulation applies fixed judgments and treatments; AdaSID instead executes adaptive semantic qualification and load-adaptive regulation.
AdaSID proposes that an SID collision should be adaptively (1) judged for semantic admissibility and (2) regulated based on its collision context and training progression. The overarching objective is to preserve multimodal semantics and collaborative signals while promoting a highly disentangled and uniformly utilized SID space.
The AdaSID Framework
AdaSID is designed as a two-stage adaptive regulation system atop conventional SID tokenization via residual quantization. The method operates as follows.
Figure 2: AdaSID architecture: collaborative item pairs are mapped to quantized embeddings and SIDs, then regulated via semantic-adaptive relaxation and spatial/temporal adaptive collision handling.
1. Semantic-Adaptive Overlap Relaxation
Every observed overlap (collision) between item SIDs is first semantically qualified. The cosine similarity between continuous multimodal embeddings determines if a discrete overlap should be relaxed (i.e., allowed). Semantic thresholds are depth-aware: shallow overlaps require only moderate semantic proximity, while deeper overlaps demand strict semantic equivalence for relaxation. This mechanism prevents unnecessary repulsion of semantically coherent items and over-segmentation of semantically admissible overlaps.
2. Adaptive Pressure Allocation
For overlaps not relaxed in the first stage, AdaSID distributes repulsion via two adaptive principles:
- Load-Adaptive Collision Strengthening: Collisions in densely overloaded SID regions (high-frequency overlap signatures) receive amplified repulsion, scaling with their local "collision load". This spatial adaptivity mitigates region-specific collapse.
- Progress-Adaptive Objective Rebalancing: Collision penalty and collaborative alignment weights are scheduled according to normalized training progress. Initially, collision penalties are strong to prevent SID-space collapse; over time, collaborative alignment dominates to optimize downstream recommendation performance.
Together, these mechanisms lead to an overall training objective that combines reconstruction, residual quantization, adaptive collision, and collaborative alignment terms, weighted via training-stage awareness.
Analyses of SID Space Structure
AdaSID directly optimizes for codebook utilization and SID space diversity. Comparative analyses highlight its effectiveness.

Figure 3: SID space landscape: AdaSID tokenizers yield higher entropy, reduced dominant-code concentration, and improved weakest-layer utilization compared to baselines.
AdaSID demonstrates:
- Higher normalized minimum perplexity and SID entropy: Improved utilization across all codebook layers.
- Reduced top-1 code load: Weaker index collapse, more evenly distributed SIDs.
- Structural balance: Gains extend across all diversity/utilization axes, not arising from aggressive tuning of any single statistic.
Empirical Evaluation
Offline Benchmarks
AdaSID was evaluated on Amazon Beauty and Toys multimodal datasets, using standardized item encoders and downstream TIGER architecture. Compared with a spectrum of recent quantization/tokenization methods (RQ-VAE, Improved VQGAN, GRVQ, SimRQ, RQ-KMeans, QuaSID), AdaSID reports consistent and significant improvements in Recall@3/5 and NDCG@3/5. On Toys, NDCG@3 improves from 0.0164 (best baseline) to 0.0175, and Recall@3 from 0.0195 to 0.0214—roughly 6–10% relative gains on fine-grained ranking metrics.
Industrial-Scale Online Deployment
In a production-scale A/B test on Kuaishou's short-video commerce retrieval model covering >10 million users, AdaSID yields statistically significant uplifts:
- GMV (Gross Merchandise Volume): +0.98% gain
- Orders: +0.91%
- GPM (GMV per mille exposures): +1.16%
Offline ranking (CTCVR, cold-start CVR) shows consistent, albeit smaller, AUC improvements. These results demonstrate both recommendation and SID representation improvements translate into tangible product value.
Ablation and Sensitivity Studies
Component-wise ablation reveals:
Sensitivity analysis confirms that AdaSID performs optimally with balanced adaptation; excessively high or low aggression in adaptive mechanisms suppresses gains.
Theoretical and Practical Implications
AdaSID establishes a general template for adaptive regularization in discrete representation learning. Adaptive overlap regulation, load/context-aware penalties, and training-stage dependent weighting are complementary modules that can be repurposed for joint user-item discrete modeling and tightly-coupled generative recommenders. The framework is directly compatible with residual quantization and can be extended to deeper codebooks and richer multimodal signal integration.
Practically, AdaSID enables more robust and reusable SID spaces for real-time retrieval, ranking, and LLM-driven generative recommendation (2604.23522).
Conclusion
AdaSID advances SID learning for multimodal recommendation by introducing adaptive, context- and progress-aware collision handling. Offline and online experiments confirm its ability to improve both ranking accuracy and SID-space utility. Future directions include joint user-item discrete modeling, tighter end-to-end generative integration, and application to at-scale multimodal recommendation frameworks.