Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM

Published 25 Apr 2026 in cs.CV | (2604.23314v1)

Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper presents a Saliency-Guided Prompt Distillation (SPD) framework that refines noisy clinical prompts by leveraging anatomical priors.
It employs a two-stage process combining Anatomical Prior Learning and Contextual Prompt Distillation to enhance segmentation accuracy, improving metrics like Dice by 14.2%.
Empirical results on diverse MRI and CT datasets demonstrate SPD’s superiority over existing methods, ensuring robust segmentation despite prompt noise.

Saliency-Guided Prompt Distillation for Robust Medical Image Segmentation with SAM

Motivation and Problem Definition

Robust segmentation of anatomical structures is fundamental for clinical assessment and intervention planning, yet the clinical reality is that precise prompts—such as pixel-perfect points and boxes required by prompt-adapted segmentation foundation models—are rarely available. The Segment Anything Model (SAM) achieves state-of-the-art zero-shot segmentation results with high-quality prompts on natural images, but it fails in clinical workflows where annotation is noisy, coarse, or ambiguous. Centerline points provided by radiologists often follow broad anatomical paths, drifting between neighboring structures and only partially covering the region of interest. This prompt noise degrades SAM's adaptation and limits its reliability in medical deployment.

Figure 1: Examples of noisy clinical prompts—centerline annotations (red points) and ground-truth masks (green) for abdominal MRI—highlighting regions where ambiguous prompts mislead SAM away from ideal segmentation.

Methodology: Saliency-Guided Prompt Distillation Framework

The proposed Saliency-Guided Prompt Distillation (SPD) framework addresses the bottleneck imposed by noisy prompts, emulating expert reasoning through two stages: Anatomical Prior Learning and Prompt-Guided Segmentation.

Stage I: Anatomical Prior Learning A lightweight saliency head, co-trained with LoRA-adapted encoder features, generates anatomical saliency maps using limited pixel-wise supervision. These maps function as anatomical priors, highlighting regions plausibly belonging to target structures even under sparse annotation. The saliency head is optimized via Dice and Focal Loss, capturing coarse yet task-relevant localization information without requiring high segmentation fidelity.
Stage II: Prompt-Guided Segmentation Contextual Prompt Distillation (CPD) leverages anatomical priors for prompt refinement. The module filters local prompts, then enriches them by integrating cues from adjacent slices, enforcing spatial consistency and anatomical coherence. This yields a “consensus prompt set” that mitigates the ambiguity of single-slice cues, paralleling how radiologists synthesize information across slices. The SAM decoder, adapted via LoRA, produces the final segmentation guided by these consensus prompts. A Pairwise Slice Consistency (PSC) loss constrains predictions on adjacent slices to agree, regularizing local anatomical continuity without enacting global smoothing.
Figure 2: SPD architecture: Stage I learns anatomical saliency; Stage II distills consensus prompts from noisy cues across neighboring slices, enforcing local spatial coherence for robust segmentation.

Effectiveness of Consensus Prompts and Cross-Slice Enrichment

SPD’s CPD module performs local prompt validation followed by context-aware enrichment using dual saliency map validation. Prompts from neighboring slices, if anatomically consistent, are propagated to the current slice. This process consolidates sparse, noisy inputs into a consensus prompt set providing reliable guidance for SAM.

Figure 3: Visualization of consensus prompt integration—prompts from neighboring slices (yellow crosses) augment sparse local cues (red dots), closely aligning with ground truth regions (green masks).

Empirical evaluation shows that consensus prompts far outperform both randomly sampled noisy points and full original centerline sets. On the TI dataset, consensus prompts alone increase Dice by 14.2% and IoU by 13.6% in zero-shot frozen SAM inference.

Figure 4: Comparative zero-shot segmentation accuracy using various prompt sources—consensus prompts yield strong improvements over all baselines, including full noisy annotation sets.

Empirical Results and Comparative Performance

SPD is benchmarked on four medical imaging datasets spanning MRI and CT, including scenarios with both real and synthetically noisy prompts. It demonstrates consistent superiority over both traditional supervised segmentation models (UNet, TransUNet, nnUNet) and recent SAM-based adaptations (SAM-Tuning, MedSAM, MSA, SAM-Refiner). For example, on the TI dataset—which features genuine clinical prompt noise—SPD achieves 73.58% DSC, surpassing the best baseline (SAM-Tuning) by +11.08%, while reducing HD95 by 6.28. Improvements are similarly strong on Scar, FUMPE, and KiTS datasets.

Qualitative analysis highlights SPD’s impact: traditional methods under-segment or mis-segment due to annotation scarcity, while SAM-based models are misled by noisy prompts into erroneous regions. SPD’s distilled guidance produces segmentations that are both more accurate and anatomically faithful.

Figure 5: Qualitative comparison demonstrating SPD's robust segmentation in challenging views and diverse datasets relative to competing methods.

Beyond 2D slices, SPD outperforms recent 3D SAM-based approaches (SAM2, MedSAM2) on volumetric metrics such as Average Dice and Volumetric Dice, underscoring SPD’s superior prompt robustness and spatial coherence across volume data.

Figure 6: Performance comparison with 3D SAM-based baselines on the TI dataset, showing consistent superiority in both region and volumetric overlap metrics.

Ablation Studies and Hyperparameter Robustness

SPD’s critical components are validated through ablation:

Local prompt validation enhances DSC by +4.45% over baseline.
CPD (without PSC) provides an additional +3.37% boost, confirming the value of contextual refinement.
PSC further improves DSC (to 73.58%), adding boundary precision and segmentation stability.

Analysis of PSC reveals that bi-directional spatial coherence does not translate to better results compared to uni-directional enforcement, likely due to conflicting anatomical context. SPD is demonstrated to be robust to hyperparameter choices, including saliency threshold and contextual slice number.

Figure 7: Impact of guidance source in PSC—uni-directional slice consistency yields optimal segmentation; bi-directional enforcement introduces marginal degradation.

Figure 8: Ablation on the number of contextual slices in CPD—performance improves with moderate context but does not degrade with higher slice inclusion.

SPD’s saliency map is highly confident and discriminative, with background dominating the prediction space, minimizing sensitivity to the saliency threshold and simplifying deployment.

Figure 9: Distribution of predicted saliency values reveals a clear background/foreground separation, indicating robustness to threshold selection.

Practical and Theoretical Implications

SPD presents a practical solution for foundation model deployment in medical environments typified by prompt ambiguity and annotation scarcity. By emulating expert spatial reasoning and leveraging contextual anatomical priors, SPD achieves both prompt robustness and segmentation accuracy, bridging the gap between research-grade models and clinical applicability. The generalizable architecture supports integration with diverse promptable foundation models and can catalyze future research into cross-slice reasoning, uncertainty-aware distillation, and scalable medical model adaptation.

Conclusion

Saliency-Guided Prompt Distillation (SPD) enables robust adaptation of foundation models like SAM to medical image segmentation in the presence of inherently noisy prompts. Through anatomical prior learning, contextual prompt distillation, and spatial slice consistency, SPD achieves consistent improvement over state-of-the-art methods across several modalities and datasets. The effective handling of prompt ambiguity is pivotal for real-world clinical model deployment, and SPD sets a principled foundation for future advances in robust prompt-guided segmentation.