Answering Real-world Clinical Questions Using Large Language Model Based Systems · The Large Language Model Bible Contribute to LLM-Bible

Answering Real-world Clinical Questions Using Large Language Model Based Systems

Low Yen Sia 1 And 2, Jackson Michael L. 1 And 2, Hyde Rebecca J. 1 And 2, Brown Robert E. 1 And 2, Sanghavi Neil M. 1 And 2, Baldwin Julian D. 1 And 2, Pike C. William 1 And 2, Muralidharan Jananee 1 And 2, Hui Gavin 1 And 2, Alexander Natasha 1 And 7, Hassan Hadeel 1 And 7, Nene Rahul V. 1 And 7, Pike Morgan 1 And 7, Pokrzywa Courtney J. 1 And 7, Vedak Shivam 1 And 7, Yan Adam Paul 1 And 7, Yao Dong-han 1 And 7, Zipursky Amy R. 1 And 7, Dinh Christina 1 And 7, Ballentine Philip 1 And 7, Derieg Dan C. 1 And 7, Polony Vladimir 1 And 7, Chawdry Rehan N. 1 And 7, Davies Jordan 1 And 7, Hyde Brigham B. 1 And 7, Shah Nigam H. 1 And 7, Gombar Saurabh 1 And 8. Arxiv 2024

[Paper]    
Agentic Applications GPT Model Architecture RAG Reinforcement Learning Survey Paper

Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.

Similar Work