Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale

Published 26 Apr 2026 in cs.IR and cs.MM | (2604.23522v1)

Abstract: Modern recommendation systems involve massive catalogs of multimodal items, where scalable item identification must balance compactness, semantic fidelity, and downstream effectiveness. Semantic IDs (SIDs) address this need by representing items as short discrete token sequences derived from multimodal signals, providing a compact interface for retrieval, ranking, and generative recommendation. However, effective SID learning is hindered by collisions, where different items are assigned identical or highly confusable codes. Existing methods mainly rely on improved quantization or fixed overlap regularization, but they do not adaptively distinguish whether an overlap should be suppressed or preserved. We propose AdaSID, an adaptive semantic ID learning framework for recommendation. AdaSID regulates SID overlaps through a two-stage process. First, it relaxes repulsion for observed overlaps when the involved items are semantically compatible, preserving admissible sharing rather than uniformly separating all collisions. Second, it allocates the remaining regulation pressure according to local collision load and training progress, strengthening control in congested regions while gradually rebalancing optimization toward recommendation alignment. This design adaptively decides which overlaps to penalize, how strongly to regulate them, and when to shift the learning focus. Extensive offline and online experiments validate AdaSID. On two public benchmarks, AdaSID improves Recall and NDCG by about 4.5% on average over strong baselines, while improving codebook utilization and SID diversity. In Kuaishou e-commerce, an online A/B test on short-video retrieval covering tens of millions of users achieves statistically significant gains, including a 0.98% GMV improvement, and industrial ranking evaluation shows consistent AUC improvements.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces a novel adaptive semantic ID framework that handles collisions through semantic-adaptive relaxation and load-adaptive strengthening.
It leverages a two-stage regulation system to enhance codebook utilization, achieving approximately 6–10% gains in ranking metrics on benchmark datasets.
The framework’s industrial evaluation demonstrates significant uplifts in GMV, orders, and GPM in a production-scale A/B test for multimodal recommendation.

Adaptive Semantic ID Learning for Multimodal Recommendation: The AdaSID Framework

Introduction

Industrial-scale recommender systems are increasingly dependent on the effective representation of items with rich multimodal signals—including text, images, and video. Semantic IDs (SIDs), which discretize multimodal features into short token sequences, have emerged as a practical solution, offering a more semantically informed alternative to sparse ID embeddings. However, learning discriminative and recommendation-aligned SIDs remains challenging, primarily due to the problem of discrete space collisions—distinct items sharing identical or overly similar SIDs. Addressing these challenges necessitates novel approaches for adaptive collision handling and codebook utilization in the SID space. This work presents AdaSID, an adaptive SID learning framework, that introduces two-stage adaptive overlap regulation and demonstrates measurable gains in both offline and industrial online settings.

Problem Formulation and Motivation

The central issue in SID learning is that fixed, static treatments of SID collisions are overly rigid. Collisions may arise from semantic ambiguity or from benign multimodal similarity, necessitating differentiated regulation. Furthermore, uniform penalty allocation across collisions fails to model the highly heterogeneous collision load and evolving training dynamics in large-scale datasets.

Figure 1: Static overlap regulation applies fixed judgments and treatments; AdaSID instead executes adaptive semantic qualification and load-adaptive regulation.

AdaSID proposes that an SID collision should be adaptively (1) judged for semantic admissibility and (2) regulated based on its collision context and training progression. The overarching objective is to preserve multimodal semantics and collaborative signals while promoting a highly disentangled and uniformly utilized SID space.

The AdaSID Framework

AdaSID is designed as a two-stage adaptive regulation system atop conventional SID tokenization via residual quantization. The method operates as follows.

Figure 2: AdaSID architecture: collaborative item pairs are mapped to quantized embeddings and SIDs, then regulated via semantic-adaptive relaxation and spatial/temporal adaptive collision handling.

1. Semantic-Adaptive Overlap Relaxation

Every observed overlap (collision) between item SIDs is first semantically qualified. The cosine similarity between continuous multimodal embeddings determines if a discrete overlap should be relaxed (i.e., allowed). Semantic thresholds are depth-aware: shallow overlaps require only moderate semantic proximity, while deeper overlaps demand strict semantic equivalence for relaxation. This mechanism prevents unnecessary repulsion of semantically coherent items and over-segmentation of semantically admissible overlaps.

2. Adaptive Pressure Allocation

For overlaps not relaxed in the first stage, AdaSID distributes repulsion via two adaptive principles:

Load-Adaptive Collision Strengthening: Collisions in densely overloaded SID regions (high-frequency overlap signatures) receive amplified repulsion, scaling with their local "collision load". This spatial adaptivity mitigates region-specific collapse.
Progress-Adaptive Objective Rebalancing: Collision penalty and collaborative alignment weights are scheduled according to normalized training progress. Initially, collision penalties are strong to prevent SID-space collapse; over time, collaborative alignment dominates to optimize downstream recommendation performance.

Together, these mechanisms lead to an overall training objective that combines reconstruction, residual quantization, adaptive collision, and collaborative alignment terms, weighted via training-stage awareness.

Analyses of SID Space Structure

AdaSID directly optimizes for codebook utilization and SID space diversity. Comparative analyses highlight its effectiveness.

Figure 3: SID space landscape: AdaSID tokenizers yield higher entropy, reduced dominant-code concentration, and improved weakest-layer utilization compared to baselines.

AdaSID demonstrates:

Higher normalized minimum perplexity and SID entropy: Improved utilization across all codebook layers.
Reduced top-1 code load: Weaker index collapse, more evenly distributed SIDs.
Structural balance: Gains extend across all diversity/utilization axes, not arising from aggressive tuning of any single statistic.

Empirical Evaluation

Offline Benchmarks

AdaSID was evaluated on Amazon Beauty and Toys multimodal datasets, using standardized item encoders and downstream TIGER architecture. Compared with a spectrum of recent quantization/tokenization methods (RQ-VAE, Improved VQGAN, GRVQ, SimRQ, RQ-KMeans, QuaSID), AdaSID reports consistent and significant improvements in Recall@3/5 and NDCG@3/5. On Toys, NDCG@3 improves from 0.0164 (best baseline) to 0.0175, and Recall@3 from 0.0195 to 0.0214—roughly 6–10% relative gains on fine-grained ranking metrics.

Industrial-Scale Online Deployment

In a production-scale A/B test on Kuaishou's short-video commerce retrieval model covering >10 million users, AdaSID yields statistically significant uplifts:

GMV (Gross Merchandise Volume): +0.98% gain
Orders: +0.91%
GPM (GMV per mille exposures): +1.16%

Offline ranking (CTCVR, cold-start CVR) shows consistent, albeit smaller, AUC improvements. These results demonstrate both recommendation and SID representation improvements translate into tangible product value.

Ablation and Sensitivity Studies

Component-wise ablation reveals:

Semantic-adaptive relaxation (SeAR): Most critical on datasets with high semantic variance (e.g., Toys), with removal causing the largest performance drop.
Progress-adaptive rebalancing (PAR): Especially crucial in datasets sensitive to training dynamics (e.g., Beauty).
Load-adaptive strengthening (LAS): Contributes consistently, stabilizing SID-space structure but with milder impact than SeAR/PAR.
Figure 4: AdaSID hyperparameter sensitivity (Beauty dataset): Performance is best with moderate adaptation strength; over-tuning reduces gains.

Sensitivity analysis confirms that AdaSID performs optimally with balanced adaptation; excessively high or low aggression in adaptive mechanisms suppresses gains.

Theoretical and Practical Implications

AdaSID establishes a general template for adaptive regularization in discrete representation learning. Adaptive overlap regulation, load/context-aware penalties, and training-stage dependent weighting are complementary modules that can be repurposed for joint user-item discrete modeling and tightly-coupled generative recommenders. The framework is directly compatible with residual quantization and can be extended to deeper codebooks and richer multimodal signal integration.

Practically, AdaSID enables more robust and reusable SID spaces for real-time retrieval, ranking, and LLM-driven generative recommendation (2604.23522).

Conclusion

AdaSID advances SID learning for multimodal recommendation by introducing adaptive, context- and progress-aware collision handling. Offline and online experiments confirm its ability to improve both ranking accuracy and SID-space utility. Future directions include joint user-item discrete modeling, tighter end-to-end generative integration, and application to at-scale multimodal recommendation frameworks.

Markdown Report Issue