AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Published 11 May 2026 in cs.AI | (2605.10286v1)

Abstract: Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. LLM-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper rigorously benchmarks LLM agents in multimodal clinical prediction, comparing single-agent and multi-agent systems on real ICU datasets.
It demonstrates that unified, single-agent models benefit from cross-modal fusion, while decentralized multi-agent approaches suffer from poor calibration and reduced performance.
Experimental findings show that specialized supervised models currently outperform LLM agents for risk stratification, highlighting the need for improved coordination and uncertainty calibration.

AgentRx: Systematic Benchmarking of LLM Agents for Multimodal Clinical Prediction

Background and Motivation

Clinical decision support systems (CDSS) entail synthesizing heterogeneous clinical data streams, notably EHR time series, medical images, and unstructured text such as radiology reports and clinical notes. While deep fusion models achieve competitive performance on risk stratification and forecasting, these approaches are plagued by opacity and lack flexibility when dealing with complex, incomplete, or variable-mix datasets. The proliferation of LLMs, especially ones adapted for healthcare, elevates natural language reasoning and interpretability but leaves their efficacy—especially for multimodal, temporally anchored prediction tasks—inadequately characterized.

Agent-based architectures, where separate agents process modality-specific data and collaborate (analogous to federated or system-of-experts style inference), could in principle address data fragmentation and privacy issues that inhibit centralization. Yet the interaction of agentic system design choices with prediction accuracy, calibration, and robustness over varying data regimes had not been comprehensively benchmarked prior to this work.

AgentRx Benchmark: Design and Methodology

AgentRx introduces a multilayered benchmarking suite utilizing real-world ICU data from MIMIC-IV, MIMIC-CXR, and MIMIC-IV-Note, harmonized at the patient-admission level. The clinical endpoints evaluated are in-hospital mortality and prolonged length of stay prediction—both binary tasks, with feature windows restricted to the first 48 ICU hours.

Four data modalities are included: Patient Summaries (structured demographic and history), EHR time series (17 clinical variables), CXR images, and radiology reports. These modalities are paired as available to create a large, heterogeneously populated cohort.

AgentRx formalizes three primary operating regimes:

Single agent unimodal: An LLM agent receives only Patient Summaries as input.
Single agent multimodal: A generalist agent processes all available modalities jointly, within one prompt context window.
Multi-agent multimodal: Specialized agents each process a single modality; their predictions (binary probabilities) are combined (via averaging, voting, or collaborative debate) to yield the system output.

Comparison baselines include established supervised deep learning architectures (BioBERT for unimodal text, MedPatch for multimodal fusion), alongside a suite of agentic prompting strategies: zero-shot, few-shot, chain-of-thought (CoT), self-consistency, and self-refinement, as well as multi-agent decision aggregation protocols (majority vote, debate, meta-prompting, trajectory-based agents, MDAgents, MedAgents).

For LLM and VLM backbones, both general-domain (Qwen2.5-VL, InternVL2.5) and clinical-specialized (HuatuoGPT-Vision, LLaVA-Med) models are benchmarked at comparable parameter scales (~7–8B).

Key evaluation metrics are AUROC, AUPRC, and expected calibration error (ECE), quantifying discriminative utility, precision-recall tradeoff, and output probability reliability, respectively.

Experimental Results and Analysis

Unimodal (Text) Prediction

Specialized medical LLMs outperform supervised text classifiers when predicting mortality from patient summaries, with HuatuoGPT-Vision achieving the highest AUROC (0.700, few-shot). However, LLM-derived probabilistic outputs are poorly calibrated relative to discriminative models (BioBERT ECE: 0.006; HuatuoGPT-Vision ECE: 0.093+; LLaVA consistently shows ECE > 0.75). For prolonged length of stay, traditional supervised models yield higher discriminative and probabilistic accuracy, suggesting that clinical language pretraining is more beneficial for diagnostic/etiological endpoints than for operational forecasting.

Multimodal Prediction

Best-in-class specialized multimodal fusion (MedPatch) exhibits substantially higher AUROC for both endpoints (mortality AUROC 0.877), outperforming all agentic LLM configurations. Nevertheless, generalist agents display marked gains on both AUROC and AUPRC as additional modalities are fused, confirming the ability of unified-context LLMs to exploit cross-modal correlations, albeit not matching SOTA fusion specialized models.

In contrast, multi-agent protocols generally degrade both discrimination (e.g., Qwen Debate and Meta-Prompt: AUROC 0.631 and 0.599, respectively) and calibration relative to the corresponding single multimodal agent. Only in hybrid protocols with centralized inference over concatenated reasoning (e.g., Traj-CoA) does performance approach the best single agent baseline. Multi-agent consensus protocols (majority vote, debate) exhibit increasing ECE as modality count rises, indicating aggregation fails to exploit confidence gains from composite evidence.

Ablation and Failure Mode Analysis

Modality addition ablation reveals that single agent systems benefit monotonically in both AUROC and calibration as more modalities are included; multi-agent systems, by contrast, suffer degraded calibration, indicating loss of reliable uncertainty quantification with decentralized aggregation. Debate-based architectures tend toward rapid, uncritical consensus (echo chamber or sycophancy), particularly in agent ensembles lacking divergent initialization or without robust disagreement resolution, further depressing AUROC.

Further tests on note-based disease detection reveal that with free-text inputs, single agent LLMs and multi-agent systems can, in some cases, surpass specialized supervised performance, highlighting task–modality specificity in the agentic performance gap.

Theoretical and Practical Implications

The benchmark isolates a key deficiency in current multi-agent LLM paradigms for clinical prediction: lack of robust cross-modal context integration and poor probabilistic calibration when aggregating independent unimodal inference. This sharply limits the translation of agent-based frameworks to high-stakes, heterogeneous real-world clinical settings, particularly for temporal risk forecasting rather than semantic understanding.

The results support several implications:

Interpretability alone is insufficient: LLM agents can offer semantic justifications, but without calibrated uncertainty and competitive accuracy, any clinical deployment is questionable.
Modality-specific strengths: LLMs’ effectiveness is strongly dependent on the nature of the data—free-text scenarios exhibit different dynamics compared to structured or temporal numerical data.
Multi-agent designs require nontrivial coordination: Naive aggregation (vote/average) and informal debate architectures lack the information coupling necessary for robust multimodal prediction; hybrid, centralized architectures fare better, but still trail tailored fusion models.
Calibration as a critical bottleneck: As calibration does not automatically improve with data quantity in agent systems, explicit design for reliable uncertainty aggregation is needed.

Future AI developments must address data serialization bottlenecks (enabling higher-frequency clinical time series processing), explore latent/sharing-based communication rather than purely textual agent-agent messaging, and advance architectures that combine LLM interpretability with frozen, high-precision supervised encoders.

Conclusion

AgentRx provides a rigorous, open-source benchmark for LLM-based agentic systems in multimodal clinical prediction, comprehensively quantifying the limitations and conditional strengths of single and multi-agent LLMs across data regimes and tasks. The empirical findings underscore that current multi-agent approaches are suboptimal for multimodal, high-stakes clinical risk forecasting, primarily due to weak calibration and modality fusion. The benchmark delineates clear challenges and directions for future research, focusing on robust fusion, calibration, and domain-adapted agent collaboration for agentic AI in healthcare.

Reference: "AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks" (2605.10286)

Markdown Report Issue