The ChatGPT Health Study Has a Model Problem, and a Timing Problem

A widely covered study published in Nature Medicine this week found that ChatGPT Health under-triaged more than half of simulated medical emergencies. The headlines are alarming. The findings matter. But there is a significant methodological gap in this research that most coverage has ignored entirely: the AI model the researchers actually tested is already outdated.
The study deserves scrutiny not because its concerns are invalid, but because evaluating a fast-moving technology with slow-moving research methods risks producing conclusions that are already stale by the time they reach the public.
What the Study Found
Researchers at the Icahn School of Medicine at Mount Sinai created 60 clinical scenarios spanning 21 medical specialties and tested them under 16 contextual conditions, generating 960 total interactions with ChatGPT Health. They conducted a structured stress test of triage recommendations using 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions. The tool under-triaged 52 percent of cases that physicians deemed true emergencies. The AI tool also over-reacted in lower-risk cases, over-triaging 35% of non-urgent presentations by incorrectly recommending immediate care.
These are serious findings. No one should dismiss them. But the question healthcare providers should be asking is: which model produced these results?
The Model Matters More Than the Brand Name
ChatGPT Health launched in January 2026. The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine, is the first independent safety evaluation of the LLM-based tool since its January 2026 launch. That fast-tracking is commendable. But ChatGPT Health at launch was powered by GPT-5 Mini, a smaller, cost-optimized model in OpenAI's lineup. It was not running GPT-5.1 or GPT-5.2, models that have demonstrated substantially improved reasoning, nuance, and clinical performance in benchmarks since their release.
This distinction is not trivial. The difference between GPT-5 Mini and GPT-5.2 in medical reasoning tasks is roughly analogous to the gap between a second-year medical student and a senior resident. Actually, that may not be the best way to describe this, since AI models don't "mature" through experience like humans do. They're fundamentally about scaled-up pattern recognition and computational power. A better comparison might be a basic four-function calculator versus a scientific graphing one: the Mini handles straightforward calculations just fine but buckles under multifaceted problems, while the advanced version integrates deeper functions, handles variables with nuance, and delivers more reliable results in scenarios demanding precision, like parsing subtle clinical cues. Testing the smaller model and drawing sweeping conclusions about "ChatGPT Health" as a product category is like evaluating a hospital's surgical outcomes by only observing its least experienced surgeon.
The Publication Cycle Cannot Keep Up
Here is the deeper structural problem. The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation. As one researcher noted, "Starting medical training alongside tools that are evolving in real time makes it clear that today's results are not set in stone."
Even with fast-track publication, the traditional academic pipeline introduces weeks to months of delay between data collection and public dissemination. In most fields, that lag is acceptable. In AI, where frontier models improve on a cadence of weeks, it is a fundamental limitation. By the time this study was peer-reviewed, typeset, and published, the model it evaluated may have already been superseded.
This is not a criticism of the researchers. They did rigorous work under real constraints. It is a criticism of how the findings are being interpreted: as a definitive verdict on AI-powered triage rather than a snapshot of one model at one moment.
The Counterarguments Are Real
None of this means the study should be ignored. As Isaac Kohane of Harvard Medical School noted, "LLMs have become patients' first stop for medical advice -- but in 2026 they are least safe at the clinical extremes." ECRI, an independent nonprofit patient safety organization that publishes an annual list of top health technology hazards, ranked misuse of AI chatbots in healthcare as the top health technology hazard of 2026, warning these tools "can provide false or misleading information that could result in significant patient harm."
Those concerns are valid regardless of model version. The anchoring bias finding alone, where third-party symptom minimization shifted triage recommendations dramatically, with an odds ratio of 11.7 when family members minimized symptoms, points to architectural vulnerabilities that may persist across model generations.
But valid concerns and valid methodology are two different things. If the goal is to inform policy and practice, the evidence base needs to match the technology it evaluates.
The Findings That Survive the Model Question
Not every result in this study is model-dependent. Some findings point to structural vulnerabilities that better reasoning alone is unlikely to fix.
The anchoring bias result is the clearest example. When family members or friends minimized symptoms in the prompt, triage recommendations shifted dramatically toward less urgent care, with an odds ratio of 11.7. That is not a capability gap. That is a system absorbing social pressure from the input itself. A more powerful model may reason better in isolation. But in real households, someone is almost always saying "I'm sure it's nothing." If the system remains susceptible to that framing, improved benchmarks will not translate to improved safety where it matters most.
The suicide-alert finding raises similar concerns. The system's crisis safeguards triggered more reliably in lower-risk scenarios than when users described specific plans for self-harm. That inversion is not a reasoning failure. It is a calibration problem in the safety layer itself, and it may persist across model generations.
These results deserve at least as much attention as the triage accuracy numbers. They identify failure modes that scaling alone is unlikely to resolve.
What This Means for Healthcare Providers
Healthcare providers should take three things from this moment:
- AI triage tools are not ready to replace clinical judgment. That was true before this study and remains true after it. The tool "performed well in textbook emergencies such as stroke or severe allergic reactions" but "struggled in more nuanced situations where the danger is not immediately obvious." Nuance is where clinical expertise lives.
- Model version matters. When reading any study evaluating AI performance, the first question should be: which model, which version, and when? A study testing GPT-5 Mini tells us very little about GPT-5.2's capabilities. Future research needs to specify and justify model selection the same way drug trials specify dosing.
- The research community needs new frameworks. The Mount Sinai team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools. That iterative approach is exactly right. Single-snapshot studies will always lag behind the technology. Continuous evaluation pipelines, potentially with pre-registered protocols and rolling publication, are the only way to keep pace.
The study's findings are a useful signal, not a final answer. Treating them as definitive risks either overstating the danger of current-generation tools or, worse, creating a false sense of security when the next study shows improvement.
The real takeaway is not that AI triage failed. It is that rigorous research and rapid iteration are now on fundamentally different timelines. Until we solve that mismatch, every published evaluation will be a portrait of a system that no longer exists.
Sources:


