P2207 - Assessing the Quality of AI Responses and Resistance to Sycophancy in Providing Patient-Centered Medical Advice on Gastroesophageal Reflux Disease
Benjamin Liu, MD1, Steve D'Souza, MD2, Melina Roy, MS, RDN1, Sandra Pichette, MS, RDN1, Sherif Saleh, MD1, Ronnie Fass, MD1, Gengqing Song, MD, PhD3 1MetroHealth Medical Center, Cleveland, OH; 2Case Western Reserve University / MetroHealth, Cleveland, OH; 3Metrohealth Medical Center, Cleveland, OH
Introduction: Substantial interest exists in using Large Language Models (LLMs), likeChatGPT and Google Gemini, for patient-oriented healthcare education. However, LLMs rely on human reinforcement training rather than established medical guidelines. This training has an underrecognized byproduct known as AI sycophancy where LLMs provide objectively false advice to please the human user. We conducted a novel study to evaluate the performance of freely available LLMs in providing patient education on gastroesophageal reflux disease (GERD) and how AI sycophancy may compromise the accuracy of their advice.
Methods: Gastroenterology fellows, attendings, and dietitians analyzed response quality from ChatGPT 3.5, and Gemini 1.0 to the top Google Trends searches for "acid reflux" in comparison to established sources. To assess sycophancy, we created 10 objectively false statements about GERD and input them into ChatGPT and Gemini with the words “My Doctor Said…” preceding each statement. LLM output was assessed for consistency and accuracy.
Results: ChatGPT 3.5 (p = 0.0039) and Gemini 1.0 (p = 0.0206) provided more comprehensive than both UpToDate® and publicly available hospital sources for the query “Acid Reflux Symptoms”. Both ChatGPT 3.5 (p < 0.001) and Gemini 1.0 (p < 0.001) provided answers at a higher reading level than UpToDate®. ChatGPT 3.5 agreed with 4/10 factually incorrect sycophantic prompts and did not change its stance even with prompting. Gemini 1.0 manufactured a false supporting statement in 1/10 prompts.
Discussion: Current LLMs are liable to provide misleading or false advice when prompted in a sycophantic manner. Importantly, sycophantic prompts can be unintentional. This vulnerability suggests that current models are not ready for integration into patient education frameworks.
Figure: Figure 1: Sycophantic Query with Example Answers from ChatGPT 3.5 and Google Gemini 1.0, with Follow-up Consistency Check and Answer. Answers Highlighted in Red are Factually Incorrect.
Note: The table for this abstract can be viewed in the ePoster Gallery section of the ACG 2024 ePoster Site or in The American Journal of Gastroenterology's abstract supplement issue, both of which will be available starting October 27, 2024.
Disclosures:
Benjamin Liu indicated no relevant financial relationships.
Steve D'Souza indicated no relevant financial relationships.
Melina Roy indicated no relevant financial relationships.
Sandra Pichette indicated no relevant financial relationships.
Sherif Saleh indicated no relevant financial relationships.
Ronnie Fass: Braintree laboratories/Sebela – Advisor or Review Panel Member. carnot – Advisor or Review Panel Member, Speaker. Daewoong – Speaker. Dexcal – Advisor or Review Panel Member. Eisai, Inc. – speaker. GERDCare/Celexio – Advisor or Review Panel Member. Intra-Sana – Speaker. Medicamenta – Speaker. Medtronic – Advisor or Review Panel Member. Phathom Pharmaceuticals – Advisor or Review Panel Member. Restech – Stock Options. Syneos – Advisor or Review Panel Member. Takeda – Speaker.
Gengqing Song indicated no relevant financial relationships.
Benjamin Liu, MD1, Steve D'Souza, MD2, Melina Roy, MS, RDN1, Sandra Pichette, MS, RDN1, Sherif Saleh, MD1, Ronnie Fass, MD1, Gengqing Song, MD, PhD3. P2207 - Assessing the Quality of AI Responses and Resistance to Sycophancy in Providing Patient-Centered Medical Advice on Gastroesophageal Reflux Disease, ACG 2024 Annual Scientific Meeting Abstracts. Philadelphia, PA: American College of Gastroenterology.