ChatGPT and LLMs in Medicine: Insights and Lessons Learned
- Ozzie Paez
- Jan 20
- 2 min read
Updated: Feb 4
My first series of posts in 2025 will focus on practical lessons learned and insights gained evaluating ChatGPT and LLMs in medicine. They will be helpful to clinicians eager to safely and effectively exploit these remarkable technologies to deliver more compelling patient value. A separate series will be aimed at patients who use ChatGPT4 as a trusted, empathetic, and convenient source of medical advice. This controversial subject has sparked heated debate within the medical community that continues unabated.
Our work before ChatGTP4’s release centered on advancing physiological sensor technologies used in clinical monitors and consumer wearables like the iPhone. In this context, ChatGPT-4’s ability to quickly process, interpret, and contextually explain monitoring data felt game-changing. Still, while personally impressed, I’d learn from experience that medicine and healthcare are demanding environments where many “transformative” technologies failed. So, we collaborated with clinicians and technologists to develop new, AI-specific evaluation and testing strategies. I will explain why this was necessary in a later post. Our objective was to assess how well ChatGPT4 could safely and effectively serve doctors and patients.
The results after more than a year are decidedly mixed. They suggest that ChatGPT4 is neither inherently safe and effective nor catastrophically risky. The evaluation concluded that it is more likely to misinform patients with no clinical experience and poorly trained users, including patients and clinicians. Patients could ideally benefit from custom GPTs designed to improve their queries and ChatGPT4’s responses. I noted that the applicability and quality of ChatGPT4’s answers improved when it was provided and instructed to consider quantitative monitoring data and clinical test results.
Finally, the results provided convincing evidence that ChatGPT and LLM testing must be ongoing because their responses can vary widely following updates to their models and training data. For example, I’ve noticed that the latest version of ChatGPT4 (4o) performs better in some applications and underperforms in others, including editing technical writing. Our continuing evaluations will expand in 2025 to include comparisons with Google’s Gemini and their implications for different specialties, including cardiology, orthopedics, and primary care.
Please reach out with your thoughts or questions. I can be reached directly at ozzie@oprhealth.com, 303-332-5363.
Comments