P4901 - Evaluating the Performance of Chat Generative Pretrained Transformer 4.0 in Advanced Medical Standardized Testing With Prompt-Engineered Question Responses and Non-Prompt-Engineered Question Responses

Tuesday, October 29, 2024

10:30 AM - 4:00 PM ET

Location: Exhibit Hall E

Has Audio

Presenting Author(s)

FM

Faisal Mehmood, MD

HonorHealth
Glendale, AZ

Faisal Mehmood, MD¹, Collin J. Pitts, MD, MPH², Hajra Jamil, MD³, Joseph Fares, MD², Gavin Levinthal, MD²
¹HonorHealth, Phoenix, AZ; ²HonorHealth, Scottsdale, AZ; ³Services Institute of Medical Sciences, Lahore, Punjab, Pakistan

Introduction: Prompt engineering is a process used in the field of artificial intelligence (AI), specifically with generative AI systems. It involves crafting well-defined and structured input queries or prompts for AI models, which allows them to produce desired responses and fine-tune their behavior for specific tasks.

Methods: We aimed to assess the performance of ChatGPT 4.0 in standardized testing, using the 2022 version of the ACG self-assessment question pool. Our primary objective was to assess the scores if ChatGPT 4.0 is provided with additional prompts. This self-assessment required a score of 70% to pass. We used 160 out of 300 questions that didn’t have any images. The question pool was divided into four equal sections for upload as .docx files. The test prompt began with “Provide just the answers to the attached multiple-choice questions.” A second test was similarly prepared, with each question engineered to include “You are a gastroenterologist consulted to see” followed by the question prompt. The same question pool was used for each category. Both testing sessions were used on separate private chat sessions to minimize changes in responses due to profile-specific response modifications. Responses were recorded as the output from the ChatGPT prompt. We also checked the subcategory of each question (esophagus, stomach, colon, etc.) and assessed the percentage of other test-takers who answered the question correctly. Low-difficulty questions were defined as those with 90% prior correct answers, moderate-difficulty questions with 75%–90% correct answers, and challenging questions with ≤ 75% correct answers.

Results: Overall, ChatGPT-4.1 scored 61.8 % on 160 included questions with prompt engineering and 63.75% with non-prompt engineering. Furthermore, there was significant concordance in both sets of test prompt responses.

Discussion: ChatGPT has previously failed the ACG self-assessment tests, including the 2021 and 2022 versions. ChatGPT-3 and ChatGPT-4 scored 65.1% and 62.4%, respectively. ChatGPT 4.0 failed the exam regardless of prompt engineering. Additional investigation may be warranted to better understand if current limitations are due to its ability to respond to certain question types, response groupings, or phrases that would result in inaccurate responses. This limits the use of ChatGPT as an additional educational resource for trainees and warrants the development of more fine-tuned models with a focus on gastroenterology.

Note: The table for this abstract can be viewed in the ePoster Gallery section of the ACG 2024 ePoster Site or in The American Journal of Gastroenterology's abstract supplement issue, both of which will be available starting October 27, 2024.

Disclosures:

Faisal Mehmood indicated no relevant financial relationships.

Collin Pitts indicated no relevant financial relationships.

Hajra Jamil indicated no relevant financial relationships.

Joseph Fares indicated no relevant financial relationships.

Gavin Levinthal indicated no relevant financial relationships.

Faisal Mehmood, MD¹, Collin J. Pitts, MD, MPH², Hajra Jamil, MD³, Joseph Fares, MD², Gavin Levinthal, MD². P4901 - Evaluating the Performance of Chat Generative Pretrained Transformer 4.0 in Advanced Medical Standardized Testing With Prompt-Engineered Question Responses and Non-Prompt-Engineered Question Responses, ACG 2024 Annual Scientific Meeting Abstracts. Philadelphia, PA: American College of Gastroenterology.