Unified Supervision for Walmart's Sponsored Search Retrieval via Joint Semantic Relevance and Behavioral Engagement Modeling

Published 9 Apr 2026 in cs.IR | (2604.07930v2)

Abstract: Modern search systems rely on a fast first stage retriever to fetch relevant items from a massive catalog of items. Deployed search systems often use user engagement signals to supervise bi-encoder retriever training at scale, because these signals are continuously logged from real traffic and require no additional annotation effort. However, engagement is an imperfect proxy for semantic relevance. Items may receive interactions due to popularity, promotion, attractive visuals, titles, or price, despite weak query-item relevance. These limitations are further accentuated in Walmart's e-commerce sponsored search. User engagement on ad items is often structurally sparse because the frequency with which an ad is shown depends on factors beyond relevance such as whether the advertiser is currently running that ad, the outcome of the auction for available ad slots, bid competitiveness, and advertiser budget. Thus, even highly relevant query ad pairs can have limited engagement signals simply due to limited impressions. We propose a bi-encoder training framework for Walmart's sponsored search retrieval in e-commerce that uses semantic relevance as the primary supervision signal, with engagement used only as a preference signal among relevant items. Concretely, we construct a context-rich training target by combining 1. graded relevance labels from a cascade of cross-encoder teacher models, 2. a multichannel retrieval prior score derived from the rank positions and cross-channel agreement of retrieval systems running in production, and 3. user engagement applied only to semantically relevant items to refine preferences. Our approach outperforms the current production system in both offline evaluation and online AB tests, yielding consistent gains in average relevance and NDCG.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper establishes a unified supervision framework that prioritizes semantic relevance over raw engagement to enhance sponsored search retrieval.
It integrates graded relevance labels, multi-channel retrieval priors, and debiased user engagement to fine-tune bi-encoder training.
Empirical evaluations demonstrate significant gains in metrics like P@25 and NDCG@25, validating the method’s practical impact on both offline and online systems.

Unified Supervision for Sponsored Search Retrieval: Integrating Semantic Relevance and Behavioral Engagement

Motivation and Problem Statement

Deployed e-commerce retrieval systems frequently rely on user engagement signals (e.g., clicks, orders) as supervision for large-scale bi-encoder retriever training. While user engagement is efficiently collectable, it is a notoriously noisy and biased proxy for semantic relevance, especially in sponsored ad search contexts. Engagement signals often capture influences beyond semantic match—such as item popularity, promotions, presentation, or pricing—and can be structurally sparse for candidate ads due to auction mechanics, advertiser budget constraints, and ad slot limitations. This sparsity and bias present critical challenges: retrievers supervised on engagement may favor popular but less semantically relevant items and exhibit weak generalization for cold-start and long-tail inventory. Prior frameworks have attempted to filter engagement supervision with relevance-based heuristics, but typically retain engagement as the dominant training signal. This paper proposes a fundamental shift by positioning semantic relevance as the primary supervision, using engagement strictly as a preferential signal among already-relevant items (2604.07930).

Unified Supervision Framework

The central contribution is a unified supervision paradigm for bi-encoder ad retrieval, which synthesizes training targets from three heterogeneous but complementary sources:

Graded Relevance Labels: Relevance is rated on a 5-point ordinal scale using a cascade of cross-encoder teacher models (including LLMs like Gemma and LLaMA-3) and available human annotations. Early exit based on confidence ensures computational efficiency; final labels are rescaled to $[0,1]$ .
Multi-Channel Retrieval Prior: Rank positions from production channels are captured, normalized, and aggregated to yield a prior score and channel consensus that reflect system-level confidence and identify hard negatives (highly ranked but semantically irrelevant).
User Engagement: Historical engagement (orders, add-to-carts, clicks, views) is aggregated, debiased, and combined via learnable weights and non-linear transformations. Critically, engagement is applied only among semantically relevant pairs to refine preferences, preventing popularity-driven irrelevance.

A formalized target is constructed, where positive and negative query-item pairs are scored with weighted combinations of these signals. Positives utilize relevance, rank prior, channel consensus, and a gated engagement boost. Negatives are prioritized using hard-negative mining—focusing on both deceptive (highly ranked) and lexically similar but irrelevant items.

Model Training and Losses

A MiniLM-based bi-encoder architecture is fine-tuned under this unified supervision. Contrastive losses (via Cached Multiple Negatives Ranking) and cosine similarity loss are both employed. Hard-negative mining is supported by curriculum-based sampling, emphasizing challenging negatives as determined by the unified scoring function.

Empirical Evaluation

Quantitative Results

Offline evaluation is conducted across 30,303 queries spanning both head and tail segments, with top-25 retrieval evaluated for sponsored ad slot relevancy. Key findings:

Relevance-Only Supervision: Substantially outperforms production, improving P@25 from 0.794 to 0.873 (+10.0%), NDCG@25 from 0.867 to 0.913 (+5.4%), and average relevance from 3.040 to 3.263 (+7.3%).
Unified Supervision (Relevance+Engagement): Additional gains are realized: P@25 to 0.877 (+10.5%), NDCG@25 to 0.916 (+5.7%), and average relevance to 3.277 (+7.8%). This shows that incorporating engagement preferentially among relevant candidates improves ordering without diluting semantic quality.

A/B testing on live traffic indicates significant positive lifts in business/engagement metrics, such as impressions (+0.60%, $p=0.03$ ) and add-to-cart rate (+0.99%, $p=0.009$ ).

Figure 1: Engagement supervision increases the share of highly engaged items retrieved in the Top 25 without relevance degradation.

Figure 1 demonstrates the percentage point gains for highly engaged items in the Top-25 set, underscoring that engagement-aware supervision reliably concentrates preferred items at operational positions while maintaining strict relevance filters.

Qualitative Results

Analysis of sampled queries shows the unified supervision approach both resolves off-intent retrievals and promotes highly engaged options among plausible matches: under“cetaphil baby lotion,” the production system returns an irrelevant (“wash”) product, while relevance+engagement surfaces the intended (“lotion”) item with maximal engagement. The model successfully disambiguates brands and dietary preferences, and surfaces items favored by real user behavior.

Figure 2: Green highlights: phrases aligned with intent; red highlights: failures. Unified supervision retrieves both relevant and highly-engaged items; Relevance-only retrieves relevant but unpopular items.

Figure 2 illustrates that the unified approach balances semantic alignment and empirical preference, rectifying both relevance and engagement deficiencies.

Implications and Future Directions

This work demonstrates that for sponsored search retrieval, relevance-centric supervision with engagement as a secondary within-relevance discriminator yields robust, scalable retrieval models that generalize effectively under sparsity and exposure bias. The integration of production-channel priors serves not only as a mechanism for reinforced positives but, crucially, as a conduit for efficient hard-negative mining—guiding the model to address extant production failure modes.

Practically, these results indicate that production-scale business metrics in high-volume e-commerce can be improved by decoupling engagement from relevance in model supervision, rather than conflating the two. This paradigm enables retrieval systems to align with downstream ranking and business objectives without the detrimental effect of promoting purely popular but semantically irrelevant items.

Theoretically, the unified framework opens up avenues for curriculum learning, distillation from richer teacher architectures, and targeted intervention with new forms of engagement (e.g., dwell time, post-click satisfaction, or cross-session behavior). Ongoing research may extend these principles to unified retrieval-ranking cascades, multi-modal supervision, or even zero-shot/adaptive retrieval systems leveraging foundation models.

Conclusion

This paper establishes unified, relevance-primary supervision as an effective solution for sponsored search retrieval under sparse and biased engagement. Composite targets integrating relevance, retrieval priors, and engagement yield retrieval models that are both semantically robust and preference-aligned, achieving consistent improvements over state-of-the-art production baselines in both offline and online evaluations. The general framework is immediately extensible to other large-scale retrieval and recommendation environments, promising enhanced downstream outcomes through principled supervision design.

Markdown Report Issue