University of California Los Angeles David Geffen School of Medicine Los Angeles, CA
Alexandra C. Greb, MD, Soonwook Hong, MD, Henry Zheng, BS, Vikram Sharma, , Berkeley Limketkai, MD, PhD University of California Los Angeles David Geffen School of Medicine, Los Angeles, CA
Introduction: Large language models (LLMs) are a new form of artificial intelligence that perform natural language processing tasks. Many open-source LLMs have been studied including ChatGPT (OpenAI) and BERT (Google). There are limited studies that have used LLMs and real clinical data in gastroenterology. Patients with inflammatory bowel disease (IBD) undergo routine endoscopy. Manual data abstraction of these endoscopy reports for clinical research and quality improvement can be both labor-intensive and time consuming. The aim of this study was to examine the utility of using Llama 3-8B to synthesize meaningful data from endoscopy reports.
Methods: This study included IBD patients that underwent endoscopy at a large quaternary care, academic medical center from August 30th, 2023 to November 20th, 2023. The text from their endoscopy reports was de-identified. All analyses were performed with a Llama 3-8B chat model, which was locally implemented on an air-gapped computer. The LLM was prompted to identify the presence and location of inflammation, presence of polyps and their number, size, location, and whether the polyp appeared malignant. The LLM’s accuracy, precision, recall, and F1 score were calculated. All reports were manually reviewed by a physician to verify results.
Results: The data included 253 endoscopy reports with a median length of text of 127 (interquartile range [IQR] 99) and 924 (IQR 749.75) character length. Of the total endoscopy reports, there were 119 (47%) with evidence of active inflammation and 76 (30%) with one more polyps. Llama 3-8B was able to identify the presence of polyps with an accuracy of 0.89, precision of 0.78, recall of 0.97 and F1 score of 0.86 (Table 1). It was also able to identify the presence of inflammatory changes with an accuracy of 0.85, precision of 0.75, recall of 0.93 and F1 score of 0.83.
Discussion: This study utilized Llama 3-8B to analyze patient data, including the abstraction of key endoscopy findings of inflammation and polyp characteristics. Though the LLM made some mistakes, it performed well at detecting the presence of polyps. In comparison, the LLM’s accuracy with detecting inflammation was lower though with a comparable F1 score. Notably, this was without any training. To expand the role of LLMs with use on real, clinical data, future research should focus on standardization of medical data input and fine-tuning of prompts.
Note: The table for this abstract can be viewed in the ePoster Gallery section of the ACG 2024 ePoster Site or in The American Journal of Gastroenterology's abstract supplement issue, both of which will be available starting October 27, 2024.
Disclosures:
Alexandra Greb indicated no relevant financial relationships.
Soonwook Hong indicated no relevant financial relationships.
Henry Zheng indicated no relevant financial relationships.
Vikram Sharma indicated no relevant financial relationships.
Berkeley Limketkai indicated no relevant financial relationships.
Alexandra C. Greb, MD, Soonwook Hong, MD, Henry Zheng, BS, Vikram Sharma, , Berkeley Limketkai, MD, PhD. P4323 - Summation and Interpretation of Endoscopy Reports of Patients with Inflammatory Bowel Disease Using Large Language Models, ACG 2024 Annual Scientific Meeting Abstracts. Philadelphia, PA: American College of Gastroenterology.