Of late, artificial intelligence tools like ChatGPT have been hailed for their potential ability to lighten health workers’ load … eventually triaging patients, gathering their medical history, or even offering early diagnoses.
But new research suggests these tools have a long way to go before they can replicate real-world interactions between doctors and patients.
A team from Harvard Medical School and Stanford University put four models to the test using a framework called CRAFT-MD, or Conversational Reasoning Assessment Framework for Testing in Medicine.
The models performed well on medical exams, like the ones you take to become a doctor, but struggled when it came to conversations that resembled actual doctor-patient interactions. In fact, the study noted the gap between artificial intelligence’s performance on standardized tests and its ability to handle the dynamic, often unpredictable nature of medical conversations.
In the real world, doctors must ask the right questions at the right time, connect scattered pieces of information, and reason through complex symptoms.
In contrast, when AI tools were asked to gather a patient’s history or respond to open-ended information, their diagnostic accuracy dropped significantly.
Incorporating both AI agents and human experts into the evaluation process could help move the needle toward a tool that is reliably accurate and helpful, as computers can process thousands of conversations far more quickly than humans.
But for the time being, there remains much work before these tools can come close to having the sort of chat you might have with your own flesh-and-blood doctor.