Robustness of existing radiology report evaluation approaches across modalities and anatomies

Determine whether large language model–based metrics and fine-tuned small-model evaluators developed primarily for chest X-ray report evaluation are robust when applied to radiology reports from other imaging modalities and anatomical regions.

Background

Prior work on radiology report evaluation has largely emphasized chest X-rays, using LLM-based metrics and fine-tuned small models. Because medical imaging varies substantially by modality and anatomical focus, methods tuned for chest X-rays may not generalize well to MRI, CT, ultrasound, and other domains.

The paper emphasizes that generalization beyond chest X-rays is insufficiently established and motivates a systematic study across RadEval and RaTE-Eval to assess whether such approaches maintain reliability across diverse settings.

References

However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies.

VERT: Reliable LLM Judges for Radiology Report Evaluation  (2604.03376 - Bologna et al., 3 Apr 2026) in Abstract