Personalising lung cancer screening with machine learning

Abstract

Personalised screening is based on a straightforward concept: repeated risk assessment linked to tailored management. However, delivering such programmes at scale is complex. In this work, I aimed to contribute to two areas: the simplification of risk assessment to facilitate the implementation of personalised screening for lung cancer; and, the use of synthetic data to support privacy-preserving analytics in the absence of access to patient records. I first present parsimonious machine learning models for lung cancer screening, demonstrating an approach that couples the performance of model-based risk prediction with the simplicity of risk-factor-based criteria. I trained models to predict the five-year risk of developing or dying from lung cancer using UK Biobank and US National Lung Screening Trial participants before external validation amongst temporally and geographically distinct ever-smokers in the US Prostate, Lung, Colorectal and Ovarian Screening trial. I found that three predictors – age, smoking duration, and pack-years – within an ensemble machine learning framework achieved or exceeded parity in discrimination, calibration, and net benefit with comparators. Furthermore, I show that these models are more sensitive than risk-factor-based criteria, such as those currently recommended by the US Preventive Services Taskforce. For the implementation of more personalised healthcare, researchers and developers require ready access to high-quality datasets. As such data are sensitive, their use is subject to tight control, whilst the majority of data present in electronic records are not available for research use. Synthetic data are algorithmically generated but can maintain the statistical relationships present within an original dataset. In this work, I used explicitly privacy-preserving generators to create synthetic versions of the UK Biobank before we performed exploratory data analysis and prognostic model development. Comparing results when using the synthetic against the real datasets, we show the potential for synthetic data in facilitating prognostic modelling

    Similar works