Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance
Abstract: Effective relevance modeling is crucial for e-commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged LLMs to address the limitations of traditional relevance models, especially for long-tail and ambiguous queries. By incorporating Chain-of-Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi-step reasoning. However, two key limitations remain: (1) most existing approaches rely on single-perspective CoT reasoning, which fails to capture the multifaceted nature of e-commerce relevance (e.g., user intent vs. attribute-level matching vs. business-specific rules); and (2) although CoT-enhanced LLM's offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real-time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi-Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference-time latent reasoning extractor, allowing efficient and low-latency internalization of the LLM's sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e-commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains uncertain or unexplored in the paper and could guide future research:
- Limited input modality and content scope: the student and teacher reason only over query and product title; it is unclear how MPCoT/LRKD perform when incorporating richer signals common in e-commerce (structured attributes, bullet points, descriptions, user reviews, seller metadata, images). Open question: how to extend MPCoT and LRKD to multi-modal, attribute-rich inputs while preserving latency budgets?
- Generalization across languages and locales is under-examined: results are reported for selected EN/ES/JP subsets, but there is no per-language analysis on the 6 AliExpress languages (EN/ES/KO/JP/PT/FR) nor assessment under code-switching, transliteration, or low-resource languages. Open question: how robust are MPCoT and LRKD to linguistic variability, morphology, tokenization scripts, and translation noise?
- Distribution shift and sampling bias: the AliExpress sample “does not reflect actual online traffic patterns,” yet it drives training and evaluation. Open question: how do results change under realistic traffic distributions (e.g., heavy-tailed, seasonal, promotion periods) and long-tail queries?
- Ranking relevance vs. multi-class classification: the paper frames relevance as multi-class classification and reports ACC/F1; there is no evaluation on ranking metrics (e.g., NDCG, MAP) or integration into retrieval pipelines. Open question: how to adapt MPCoT/LRKD to ranking, and what is the impact on end-to-end search metrics?
- Online serving practicality at scale: inference latency is reported on an A100 for batches of 100 pairs, but real systems score thousands of candidates per query under tight SLAs, often on CPUs or mixed hardware. Missing are QPS, p95/p99 latencies, memory footprint, and throughput under typical candidate set sizes. Open question: can LRKD meet production SLAs across heterogeneous hardware and large candidate pools?
- Interpretability at inference time: LRKD retains a latent reasoning vector that is not human-readable; no mechanism is provided for generating concise, faithful explanations online. Open question: how to translate latent reasoning into actionable, auditable explanations without incurring generative latency?
- Faithfulness and validity of teacher rationales: the paper assumes LLM CoTs are reliable and does not evaluate hallucination rates, causal faithfulness, or consistency between rationales and predictions. Open question: how to measure and enforce rationale faithfulness in MPCoT and ensure students don’t learn spurious reasoning patterns?
- Portability and maintainability of “Business Rule” perspective: rules are platform-specific and not fully specified; it’s unclear how to transfer to new marketplaces or update when policies change. Open question: can business rules be learned or parameterized (e.g., via programmatic weak supervision) to ease maintenance and portability?
- Perspective set design and scalability: the three perspectives are hand-crafted; the paper does not explore automatic discovery of new perspectives, perspective selection, or mixture-of-experts routing. Open question: how to learn, expand, or prune perspectives adaptively, and how many perspectives are optimal under compute constraints?
- DPO preference construction biases: “chosen” responses come from pass@5 across perspectives, potentially favoring verbose or longer CoTs and introducing selection bias. Open question: how do alternative preference-construction schemes (e.g., human-in-the-loop, pairwise adversarial mining, calibration for length) affect outcomes?
- Distillation signal choice and objective: LRKD aligns student latent vectors to frozen BGE-M3 CoT embeddings via MSE; the paper does not explore alternative encoders (e.g., multilingual, domain-tuned), contrastive/self-supervised objectives, or alignment to teacher hidden states/attention. Open question: which representation and loss best capture reasoning semantics while remaining efficient?
- Sensitivity analyses are missing: key hyperparameters (e.g., λ for guidance loss), the choice of embedding model, extractor architecture trade-offs, and input length truncation are not systematically probed. Open question: how sensitive are gains to these choices and where are the robustness boundaries?
- Robustness to noise and adversarial inputs: there is no evaluation under typos, spammy titles, adversarial keyword stuffing, paraphrases, or domain drift (new brands/models). Open question: what defenses (data augmentation, adversarial training, consistency regularization) are effective for MPCoT/LRKD?
- Cross-dataset and zero-shot transfer: both training and evaluation rely on the same datasets/splits; there is no study of transfer from one marketplace or taxonomy to another, or to unseen categories. Open question: how well do models generalize across datasets, taxonomies, and geographies without retraining?
- Calibration and thresholding for business tiers: the online pipeline maps scores to discrete tiers but the work does not address calibration, stability, or drift of thresholds. Open question: how to calibrate multi-class outputs reliably across time and segments, and how does LRKD affect calibration error (e.g., ECE)?
- Teacher model choice and reproducibility: results hinge on Qwen3-14B with LoRA; there is no comparison across teacher families/sizes, nor release of prompts/rules/data. Open question: how dependent are gains on the teacher architecture and can smaller or open models achieve similar results?
- Student architecture scope: only cross-encoders are considered; dual-encoders or late-interaction models (e.g., ColBERT) might better fit high-throughput retrieval. Open question: can LRKD be adapted to dual-encoders or hybrid pipelines (dual-encoder recall + cross-encoder re-rank) while preserving reasoning benefits?
- Limited error analysis: while category-level insights are discussed qualitatively, there is no detailed error taxonomy per perspective, language, or class (e.g., confusion between attribute vs. model mismatch). Open question: what failure modes persist after MPCoT/LRKD and how can targeted data or rules address them?
- Short A/B test horizon and scope: the 7-day, 10% traffic A/B shows modest lifts without confidence intervals, segment analysis, or latency impact on user flow. Open question: do improvements persist over longer horizons and across user segments, and what are the causal pathways (e.g., better filtering vs. better ranking)?
- Data governance and bias: business-rule reasoning may encode platform-specific biases (e.g., against third-party accessories/brands). Open question: how to audit and mitigate unintended biases introduced by MPCoT rationales and their distillation, especially across languages and markets?
Collections
Sign up for free to add this paper to one or more collections.