Diagnostic codes for Barrett's esophagus (BE), a precursor to esophageal
cancer, lack granularity and precision for many research or clinical use cases.
Laborious manual chart review is required to extract key diagnostic phenotypes
from BE pathology reports. We developed a generalizable transformer-based
method to automate data extraction. Using pathology reports from Columbia
University Irving Medical Center with gastroenterologist-annotated targets, we
performed binary dysplasia classification as well as granularized multi-class
BE-related diagnosis classification. We utilized two clinically pre-trained
large language models, with best model performance comparable to a highly
tailored rule-based system developed using the same data. Binary dysplasia
extraction achieves 0.964 F1-score, while the multi-class model achieves 0.911
F1-score. Our method is generalizable and faster to implement as compared to a
tailored rule-based approach