The Illusion of Clinical Readiness
A new study from Microsoft Research and Scripps Research, published in Nature Medicine, systematically applied adversarial stress tests to frontier AI models including GPT-5 and Gemini 2.5. While these models achieve near-expert scores on standard medical benchmarks, the research reveals severe robustness failures that question their readiness for real-world clinical deployment. The team designed six stress tests to probe beyond surface-level accuracy.
Six Failure Modes Exposed
Key findings include visual shortcutting, where GPT-5 scored 67.41% on NEJM Image Challenge questions even after images were removed entirely. On 197 questions requiring image interpretation, it still scored 41.32% versus 20% random chance, indicating it relied on memorized text patterns rather than genuine visual understanding. Option order dependency caused GPT-4o accuracy to crash from over 70% to 16.35% simply by shuffling multiple choice answers, showing models learned position-based heuristics. Image substitution blindness dropped GPT-5 accuracy from 84% to 35% when a diagnostic image was replaced with one matching a different diagnosis, while the question text remained identical. Reasoning hallucination produced plausible but incorrect justifications, with three failure modes: correct answer but fabricated reasoning, compounding errors from misidentified features, and vague non-diagnostic output. Additionally, benchmark quality issues were identified when three physicians rated nine common medical benchmarks across ten clinical dimensions, revealing massive variation in complexity.
Recommendations for Safer Medical AI
The researchers recommend that medical evaluation datasets include detailed metadata about the skills they test and their limitations. Model evaluation should break down results by clinical dimensions such as reasoning complexity, visual dependency, and uncertainty handling, rather than reporting a single accuracy score. Stress tests including input perturbation, modality conflict, and reasoning consistency should become mandatory in pre-release audits for medical AI.
