University of Chicago Medicine, Inflammatory Bowel Disease Center Chicago, IL
David T. Rubin, MD1, Walter Reinisch, MD, PhD2, Neeraj Narula, MD3, Daniel Colucci, BS4, William J. Eastman, MD5, Klaus Gottlieb, MD, PhD5, Ana Lacerda, MD, MSc6, Stephen Laroux, PhD7, Irene Modesto, MD8, Emma E. Navajas, BS4, Charles C. Owen, MD, MBA5, Yeli Wang, PhD4, Shrujal Baxi, MD9 1The University of Chicago Medicine, Chicago, IL; 2Medical University of Vienna, Vienna, Wien, Austria; 3McMaster University, Hamilton, ON, Canada; 4Iterative Health Inc., Cambridge, MA; 5Eli Lilly and Company, Indianapolis, IN; 6AbbVie, North Chicago, IL; 7AbbVie, Chicago, IL; 8Pfizer Inc., New York, NY; 9Iterative Health, Cambridge, MA
Introduction: The endoscopic Mayo Score (eMS) is intended to provide an objective measurement of endoscopy and is a critical component of related endpoints in Ulcerative Colitis (UC) clinical trials. Wide variability in eMS grading has been reported among central readers, contributing to inconsistency in endoscopic results. Machine learning (ML) models offer a standardized solution. Adoption of an automated eMS model requires testing to demonstrate model performance and generalizability. The objective of this study is to provide a systematic review on testing of ML eMS prediction models on full-length endoscopic video recordings from patients with UC.
Methods: Studies evaluating video-level eMS prediction models on UC endoscopy video datasets (independent of data used in model training) were included. We included all full-length manuscripts from human studies published in English. PubMed/MEDLINE, EMBASE, and Web of Science were systematically searched on December 31, 2023, and supplemented by reference checks and Google search. Three rounds of title screening were conducted. Information on test set characteristics and performance on clinically relevant endpoints were extracted independently by two authors, with disparities resolved through discussion.
Results: Five studies met criteria for inclusion, reporting data from six unique test cohorts. Five cohorts were internal test cohorts with two involving trial data (n=134-147 videos) and three involving data from routine care (n=27-51 videos). One cohort was an external test cohort which involved trial data (n=264 videos). Definition of the reference standard (i.e. ground truth) varied across studies with two cohorts reporting adjusted analyses based on modification to the definition of the reference standard. Accuracy in predicting ordinal eMS grades (0, 1, 2, 3) ranged from 56.8-83.3%. Accuracy in predicting eMS 0, 1 vs 2, 3 and eMS 0 vs 1, 2, 3 (each aligned with a definition of endoscopic improvement and remission in trials) ranged from 84-90.2% and 90-95.5%, respectively.
Discussion: Several studies have reported promising data on the performance of ML models to determine video-level eMS grades as determined by human readers in UC. This technology may ultimately provide less biased endoscopic assessments and improve standardization across clinical trials in UC. Further validation and consistency of test dataset characteristics are required to ensure model generalizability and to enable comparison across models.
Figure: Figure 1. Description of test dataset characteristics and model performance against key endpoints. Key endpoints include ordinal eMS, eMS 0, 1 vs 2, 3 (a definition of endoscopic improvement in trials), and eMS 0 vs 1,2,3 (a definition of endoscopic remission in trials). All test sets are independent of those used in model training. Red indicates testing on a clinical trial dataset. Blue indicates testing on a routine care dataset. * indicates testing on an external test set (relative to an internal test set from the same site or a holdout of the model training dataset). ^ indicates cohort results in an adjusted analysis based on modification to the definition of the reference standard. Acc, accuracy.
Emma Navajas: Iterative Health, Inc. – Employee, Stock Options.
Charles C. Owen: Eli Lilly and Company – Employee, Stock Options.
Yeli Wang: Iterative Health, Inc. – Employee, Stock Options.
Shrujal Baxi: Iterative Health, Inc. – Employee, Stock Options.
David T. Rubin, MD1, Walter Reinisch, MD, PhD2, Neeraj Narula, MD3, Daniel Colucci, BS4, William J. Eastman, MD5, Klaus Gottlieb, MD, PhD5, Ana Lacerda, MD, MSc6, Stephen Laroux, PhD7, Irene Modesto, MD8, Emma E. Navajas, BS4, Charles C. Owen, MD, MBA5, Yeli Wang, PhD4, Shrujal Baxi, MD9. P0654 - Evaluation of Machine Learning Models for the Assessment of the Endoscopic Mayo Score in Ulcerative Colitis: A Systematic Review, ACG 2024 Annual Scientific Meeting Abstracts. Philadelphia, PA: American College of Gastroenterology.