- The paper introduces a hybrid framework that integrates deep model outputs with LLM in-context reasoning to mitigate biases in historical diagnosis prediction.
- EviCare employs candidate selection, evidential prioritization, and relational evidence mining to achieve significant improvements in novel diagnosis prediction metrics.
- The framework enhances clinical interpretability by generating explicit reasoning chains and supports plug-and-play integration with various deep EHR models.
EviCare: Deep Model-Guided Evidence Integration for In-Context Diagnosis Prediction
Motivation and Limitations of LLM-Only Approaches
Electronic Health Records (EHRs) serve as high-dimensional, longitudinal repositories of clinical events critical for predictive modeling in healthcare. Recent adoption of LLMs as diagnostic assistants exploits LLMs' biomedical knowledge bases and reasoning capabilities for clinical inference. However, LLMs exhibit a strong bias toward historical diagnosis repetition, frequently overpredicting previously observed conditions and failing to surface novel, clinically emergent diagnoses essential for early intervention.
Empirical analysis shows LLMs’ next-token prediction objective fundamentally biases them to reiterate familiar codes from the patient’s history, resulting in high recall/precision for persistent or chronic diseases but low capacity for the detection of new comorbid patterns (Figure 1).
Figure 1: Comparative prediction behaviors: LLMs default to historical diagnoses, deep models surface novel codes without rationale, and EviCare integrates both for robust, explainable output.
Deep Model-Guided In-Context Reasoning: The EviCare Framework
Framework Overview
EviCare addresses above deficiencies by fusing the strengths of deep EHR models and LLMs in a unifying, interpretable in-context learning framework (Figure 2). Rather than prompting LLMs with unstructured EHR records, EviCare extracts structured evidential signals from a deep model and incorporates them into a multi-component prompt. The framework consists of four orthogonal modules:
- Deep Model Inference for Candidate Selection: A deep model (e.g., BoxLM or RETAIN) is used to propose a top-K candidate CCS code set, effectively grounding the candidate search space with structured, high-signal outputs.
- Evidential Prioritization: Candidate and historical diagnoses are prioritized according to their logit contributions to the deep model's predictions, producing an interpretable, hierarchy-aware clinical summary.
- Relational Evidence Construction: Symbolic evidence chains linking historical diagnoses to novel candidates are generated via mining of CCS co-occurrence matrices and validated ontologies.
- LLM-Based Reasoning via Structured Prompting: The LLM receives a prompt encoding all above evidence, instructing it to generate ranked predictions with explicit justification spanning both recurrent and novel conditions.
Figure 2: Architecture of EviCare, outlining deep model inference, evidential prioritization, relational evidence extraction, and LLM prompt construction for diagnosis prediction.
This architectural modularity allows plug-and-play integration with any deep model that produces probabilistic/logit outputs, thus supporting adaptation to varying clinical coding systems, backbone architectures, and deployment constraints.
Methodological Detail
Candidate Selection Mechanism
Motivated by information-theoretic results, EviCare employs the ranked logit outputs from deep EHR prediction models to construct candidate sets. This effectively reduces conditional entropy and acts as a denoising prior, confining the LLM’s generative space to high-likelihood diagnoses without inducing excessive undercoverage. Theoretical guarantees ensure that top-K logit selection is Bayes-optimal for a K-budget, and that rank-based logit weighting does not increase posterior risk compared to flat prompting—ensuring robustness and clinical precision.
Prioritization and Relational Evidence
Historical diagnoses are prioritized by marginal logit contribution to the deep model’s top candidates, propagating importance through both CCS and ICD ontologies. This transforms unordered, sparse EHR data into interpretable, relevance-ranked clinical narratives amenable to LLM reasoning.
To surface and justify novel predictions, EviCare constructs explicit relational “support edges” by mining CCS-CCS co-occurrence statistics and ontology-based hierarchical links. For each candidate novel diagnosis, the supportive historical CCS code with maximal co-occurrence is identified and encoded, enabling the LLM to explicate latent statistical dependencies otherwise hidden in black-box deep models.
Evidence-Driven Prompting for LLMs
All evidence signals are composed into a structured, machine-interpretable prompt. The prompt includes prioritized diagnosis history, relational rationale for each novel candidate, and the candidate code set. This framework supports both “overall” and “novel” diagnosis tasks; the latter explicitly masks historical codes out of the candidate pool. Notably, EviCare does not require LLM retraining or fine-tuning, leveraging only in-context adaptation.
Experimental Evaluation
Comparative and Ablation Studies
EviCare was bench-marked on the full MIMIC-III and MIMIC-IV datasets, using two main tasks: overall diagnosis prediction and the more challenging “novel” diagnosis prediction, which filters out any previously observed patient diagnoses. Metrics include P@10, P@20, and corresponding code/visit-level accuracy.
EviCare, when used with BoxLM backbone, attained a P@10 of 24.41 and Acc@10 of 22.05 on novel diagnosis prediction, outperforming the BoxLM deep model alone by substantial margins. The average relative improvement in precision and accuracy metrics exceeded 20.65% across tasks and up to 30.97% for novel conditions.
Ablation analysis (removing candidate selection, prioritization, and relational components) showed the most dramatic drop in performance upon removing relational evidence for novel codes, confirming the necessity of symbolic, model-extracted relational scaffolds (Figure 3).
Figure 3: P@10 improvement on novel (left) and overall (right) diagnosis with varying candidate set sizes, highlighting importance of evidence pruning.
Performance scaled with candidate set size, saturating at moderate K (e.g., K=50), and generalized well across different LLM backbones (Qwen3 and LLaMA3.1), confirming the decoupled design principles.
Figure 4: P@10 across varying training data sizes, demonstrating EviCare's robustness under low-resource supervision.
Interpretable Reasoning and Clinical Utility
EviCare’s outputs include explicit reasoning chains, linking deep model signals to clinical interpretation (as evidenced in case studies). In contrast to existing chain-of-thought and self-consistency prompting—where LLMs rely solely on textual heuristics—EviCare’s prompts incorporate statistical, structured, and ontological relationships, yielding truly patient-specific and clinically-verifiable rationales. Comparative analysis confirmed substantially higher correctness and interpretability scores.
Implications and Future Directions
Pragmatically, EviCare bridges the black-box numerical output of EHR-trained deep models and the generative reasoning capacity of LLMs, closing the longstanding interpretability gap in clinical AI. The framework decouples model selection from reasoning strategy, allowing institutions to integrate proprietary or locally validated deep models without retraining or significant LLM parameter updates.
Theoretically, EviCare validates the utility of hybrid, evidence-driven in-context learning for sequential clinical decision-making, and suggests a broader avenue for LLM application in other structured sequence tasks where compositional generalization and interpretability are paramount.
Future work may extend EviCare to downstream clinical scenarios such as treatment recommendation, outcome risk stratification, or temporal event prediction, and will benefit from exploring further optimization of the candidate selection and relational evidence mining components.
Conclusion
EviCare introduces an impactful hybrid framework for medical diagnosis prediction, leveraging deep model-guided evidence to scaffold and constrain LLM in-context reasoning. The architecture offers substantial numerical improvements over LLM-only and deep-only paradigms, particularly in the detection of novel diagnoses—an area of critical clinical importance. The proposed evidence-driven prompting paradigm achieves generalization, scalability, and interpretability, establishing a foundation for future intelligent clinical decision support systems (2604.10455).