This study aimed to utilize text processing and natural language processing
(NLP) models to mine clinical notes for the diagnosis of periodontitis and to
evaluate the performance of a named entity recognition (NER) model on different
regular expression (RE) methods. Two complexity levels of RE methods were used
to extract and generate the training data. The SpaCy package and RoBERTa
transformer models were used to build the NER model and evaluate its
performance with the manual-labeled gold standards. The comparison of the RE
methods with the gold standard showed that as the complexity increased in the
RE algorithms, the F1 score increased from 0.3-0.4 to around 0.9. The NER
models demonstrated excellent predictions, with the simple RE method showing
0.84-0.92 in the evaluation metrics, and the advanced and combined RE method
demonstrating 0.95-0.99 in the evaluation. This study provided an example of
the benefit of combining NER methods and NLP models in extracting target
information from free-text to structured data and fulfilling the need for
missing diagnoses from unstructured notes.Comment: IEEE ICHI 2023, see https://ieeeichi.github.io/ICHI2023/program.htm