As artificial intelligence continues to reshape healthcare, researchers are turning their attention to how AI-driven tools like ChatGPT might assist clinicians in diagnosing and managing complex conditions.
A new study published in Scientific Reports evaluates ChatGPT-4o’s performance in the differential diagnosis and management of vertigo-related disorders—an area that demands nuanced clinical judgment and experience.
Evaluating ChatGPT-4o’s Performance on Clinical Questions
The study, led by Xu Liu and colleagues at Fudan University, posed 20 vertigo-related clinical questions to ChatGPT-4o. The questions covered four categories—diagnosis, treatment, daily lifestyle advice, and prognosis—and were assessed by three otologists using a 5-point Likert scale. Evaluation metrics included accuracy, comprehensiveness, clarity, practicality, and credibility.
ChatGPT-4o performed well overall, earning high marks across all categories. Credibility received the highest average score (4.78 out of 5), followed closely by clarity (4.73), practicality (4.68), accuracy (4.65), and comprehensiveness (4.55). Responses to daily lifestyle advice questions scored a perfect 5.0 across all dimensions, while diagnosis-related responses ranged from 4.43 to 4.86. Treatment-related content averaged 4.56 to 4.78.
Statistical analysis showed significant differences among the five scoring dimensions (F = 2.682, p = 0.038), suggesting that while ChatGPT-4o is generally strong, certain aspects—such as depth and practical applicability—may vary depending on the topic.
How Well Does ChatGPT Diagnose Vertigo?
To further test ChatGPT-4o’s clinical utility, researchers provided the AI model and two otologists with 15 anonymized outpatient vertigo cases. The cases included common and complex conditions such as benign paroxysmal positional vertigo (BPPV), vestibular migraine (VM), Ménière’s disease (MD), sudden sensorineural hearing loss (SSNHL), and vestibular schwannoma (VS).
ChatGPT-4o achieved an overall diagnostic accuracy of 67%, compared to 80% for the otologist with one year of experience and 93% for the more experienced otologist. While the model correctly diagnosed all BPPV and VS cases, its performance fell short on more challenging cases like vestibular migraine and SSNHL. For example, ChatGPT failed to correctly identify two of three SSNHL cases and struggled with the nuances of VM, which can often mimic or overlap with other vestibular conditions.
These findings suggest that while the model has strengths in common vestibular diagnoses, it does not yet match the diagnostic depth of an experienced clinician.
Readability: A Barrier for Patient Education?
The study also assessed the readability of ChatGPT’s responses using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKRGL) formulas. Diagnostic-related answers were found to be the most difficult to read, averaging a Flesch score of 51.9 and a grade level of 9.8—roughly equivalent to a high school or early college reading level.
In contrast, treatment and daily lifestyle advice were more accessible, with Flesch scores near 60 and grade levels around 7.5.
While this may not pose a problem for trained professionals, it raises questions about accessibility for patients with lower health literacy. Improving readability will be critical if AI is to serve not only as a support tool for clinicians but also as a reliable source of patient education.

Clinical Implications and Cautions
The authors emphasize that while ChatGPT-4o holds promise, it should be viewed as a supplementary tool—not a replacement for clinical expertise. Its strengths lie in consistency, breadth of knowledge, and availability, but gaps remain in nuanced clinical reasoning, interpretation of imaging, and patient-specific contextualization.
Importantly, the study underscores that AI-generated information can appear highly credible even when incorrect. This makes human oversight essential, particularly when patients might consult these tools independently before seeking professional care.
From an educational perspective, ChatGPT-4o may help standardize knowledge dissemination and support early-career clinicians. However, the authors caution that integration into clinical workflows must be gradual, with continuous quality assessment and regulatory oversight.
Next Steps and Limitations
The researchers note several limitations in their methodology. The sample size was relatively small (20 questions, 15 cases), and the evaluation did not use standardized tools such as QAMAI or SMART prompts. Future studies should include a larger set of clinical scenarios and consider variations in patient language and symptom descriptions. Moreover, collaborations between AI developers and healthcare professionals will be crucial in refining chatbot training data, enhancing diagnostic accuracy, and improving readability.
Despite these limitations, the findings suggest a clear direction: AI chatbots like ChatGPT-4o are not ready to replace human clinicians, but they may serve as valuable allies—particularly in triage, education, and preliminary decision support.
Citation:
Liu, X., Shi, S., Zhang, X., Gao, Q., & Wang, W. (2025). The role of ChatGPT-4o in differential diagnosis and management of vertigo-related disorders. Scientific Reports, 15, Article 18688. https://doi.org/10.1038/s41598-025-96309-8







