Development of a Composite Health Index in Children with Cystic Fibrosis: A Pipeline for Data Processing, Machine Learning, and Model Implementation using Electronic Health Records
Cystic Fibrosis (CF) is a heterogeneous multi-faceted genetic condition that primarily affects the lungs and digestive system. For children and young people living with CF, timely management is necessary to prevent the establishment of severe disease. Modern data capture through electronic health records (EHR) have created an opportunity to use machine learning algorithms to classify subgroups of disease to understand health status and prognosis. The overall aim of this thesis was to develop a composite health index in children with CF.
An iterative approach to unsupervised cluster analysis was developed to identify homogeneous clusters of children with CF in a pre-existing encounter-based CF database from Toronto Canada. An external validation of the model was carried out in a historical CF dataset from Great Ormond Street Hospital (GOSH) in London UK. The clusters were also re-created and validated using EHR data from GOSH when it first became accessible in 2021. The interpretability and sensitivity of the GOSH EHR model was explored. Lastly, a scoping review was carried out to investigate common barriers to implementation of prognostic machine learning algorithms in paediatric respiratory care.
A cluster model was identified that detailed four clusters associated with time to future hospitalisation, pulmonary exacerbation, and lung function. The clusters were also associated with different disease related variables such as comorbidities, anthropometrics, microbiology infections, and treatment history. An app was developed to display individualised cluster assignment, which will be a useful way to interpret the cluster model clinically. The review of prognostic machine learning algorithms identified a lack of reproducibility and validations as the major limitation to model reporting that impair clinical translation.
EHR systems facilitate point-of-care access of individualised data and integrated machine learning models. However, there is a gap in translation to clinical implementation of machine learning models. With appropriate regulatory frameworks the health index developed for children with CF could be implemented in CF care