Evaluation Format, Not Model Capability, Drives Triage Failure in Consumer Health AI

This presentation reveals a critical methodological flaw in recent assessments of AI health triage systems. By systematically comparing exam-style forced-choice evaluation protocols against naturalistic patient-style interactions, the researchers demonstrate that reported emergency under-triage failures stem from evaluation design artifacts rather than model incapacity. Using over 2,000 inferences across five frontier language models and 17 clinical scenarios, the study shows that removing artificial constraints like forced discretization and knowledge suppression restores clinically appropriate recommendations, with some scenarios improving from 0% to 100% accuracy. The findings challenge high-profile claims about AI safety in consumer health applications and call for fundamental reforms in how we benchmark medical AI systems.
Script
A recent study claimed that ChatGPT Health under-triages emergencies over half the time, failing to recommend emergency care when patients desperately need it. But what if the problem wasn't the AI at all, but how we tested it?
The researchers identified three critical constraints in existing evaluations. Models were forced into exam-style multiple choice, prohibited from asking follow-up questions, and instructed to suppress their medical knowledge. None of these reflect how patients actually use health AI.
So they designed a head-to-head comparison.
Using 17 clinical scenarios and over 2,000 model inferences across five frontier language models, they tested both approaches. The constrained protocol mimicked traditional medical exams. The naturalistic protocol let models interact as they would with real patients.
The transformation was striking. The same models that appeared to fail catastrophically under exam conditions provided correct emergency recommendations when allowed to function naturally. In asthma exacerbation scenarios, three models went from zero percent accuracy to perfect performance simply by removing forced-choice constraints.
This isn't just a technical quibble. It means widely cited safety failures may tell us more about our testing methods than about AI capabilities. If we evaluate health AI using protocols that suppress the very features that make them useful, we risk both rejecting safe systems and deploying unsafe ones with the wrong safety profile.
The real danger in health AI evaluation may not be the models themselves, but the invisible scaffolding of our tests. Visit EmergentMind.com to learn more and create your own research video presentations.