Robustness of existing radiology report evaluation approaches across modalities and anatomies

Determine whether large language model–based metrics and fine-tuned small-model evaluators developed primarily for chest X-ray report evaluation are robust when applied to radiology reports from other imaging modalities and anatomical regions.

Background

Prior work on radiology report evaluation has largely emphasized chest X-rays, using LLM-based metrics and fine-tuned small models. Because medical imaging varies substantially by modality and anatomical focus, methods tuned for chest X-rays may not generalize well to MRI, CT, ultrasound, and other domains.

The paper emphasizes that generalization beyond chest X-rays is insufficiently established and motivates a systematic study across RadEval and RaTE-Eval to assess whether such approaches maintain reliability across diverse settings.

References

However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies.

— VERT: Reliable LLM Judges for Radiology Report Evaluation (2604.03376 - Bologna et al., 3 Apr 2026) in Abstract

Robustness of existing radiology report evaluation approaches across modalities and anatomies

Background

References

Related Problems