Search CORE

5 research outputs found

Recommended from our members

A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data.

Author: AIX-COVNET Collaboration
Aston John AD
Babar Judith
Breger Anna
Escudero Sánchez Lorena
Gkrania-Klotsas Effrossyni
Preller Jacobus
Roberts Michael
Rudd James HF
Sala Evis
Schönlieb Carola-Bibiane
Selby Ian
Weir-McCall Jonathan R
Publication venue: Sci Data
Publication date: 22/08/2023
Field of study

Acknowledgements: There is no direct funding for this study, but the authors are grateful for the EU/EFPIA Innovative Medicines Initiative project DRAGON (101005122) (A.B., I.S., M.R., J.B., E.G.-K., L.E.S., AIX-COVNET, J.W.M., E.S., C.-B.S.), FWF Austria (A.B.), the Trinity Challenge BloodCounts! project (M.R., C.-B.S.), the EPSRC Cambridge Mathematics of Information in Healthcare Hub EP/T017961/1 (M.R., J.H.F.R., J.A.D.A, C.-B.S.), the Cantab Capital Institute for the Mathematics of Information (C.-B.S.), NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) (I.S., J.W.M., L.E.S., E.S.), Wellcome Trust (J.H.F.R.), British Heart Foundation (J.H.F.R.), the NIHR Cambridge Biomedical Research Centre (J.H.F.R.). The European Research Council under the European Union’s Horizon 2020 research and innovation programme grant agreement no. 777826 (C.-B.S.), the Alan Turing Institute (C.-B.S.), Cancer Research UK Cambridge Centre (C9685/A25177) (C.-B.S.). In addition, C.-B.S. acknowledges support from the Leverhulme Trust project on ‘Breaking the non-convexity barrier’, the Philip Leverhulme Prize, the EPSRC grants EP/S026045/1 and EP/T003553/1 and the Wellcome Innovator Award RG98755. The AIX-COVNET collaboration is also grateful to Intel for financial support and to CRUK National Cancer Imaging Translational Accelerator (NCITA) (C22479/A28667) for use of their data repository. Lastly, we want to thank NHS AI Lab, the British Thoracic Society, Royal Surrey NHS Foundation Trust and Faculty for their great work on the NCCID and the original cleaning pipeline.The National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitLab. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations for similar future initiatives

Apollo (Cambridge)

Recommended from our members

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Author: AIX-COVNET Collaboration
Aston John AD
Dittmer Sören
Gilbey Julian
Lió Pietro
Mirtti Tuomas
Patel Mishal
Preller Jacobus
Rannikko Antti Sakari
Roberts Michael
Rudd James HF
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Torné Ramon Viñas
Publication venue: Commun Med (Lond)
Publication date: 06/10/2023
Field of study

BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable

Apollo (Cambridge)

Recommended from our members

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Author: AIX-COVNET Collaboration
Aston John AD
Dittmer Sören
Gilbey Julian
Lió Pietro
Mirtti Tuomas
Patel Mishal
Preller Jacobus
Rannikko Antti Sakari
Roberts Michael
Rudd James HF
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Torné Ramon Viñas
Publication venue: Commun Med (Lond)
Publication date: 17/11/2023
Field of study

Apollo (Cambridge)

The impact of imputation quality on machine learning classifiers for datasets with missing values

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data

Author: Aston John A. D.
Babar Judith
Breger Anna
Collaboration AIX-COVNET
Dittmer Sören
Escudero Sánchez Lorena
Gilbey Julian
Gkrania-Klotsas Effrossyni
Holzer Markus
Jefferson Emily
Korhonen Anna
Langs Georg
Li Ming
Liò Pietro
Nan Yang
Patel Mishal
Preller Jacobus
Prosch Helmut
Roberts Michael
Rudd James H. F.
Sala Evis
Schönlieb Carola-Bibiane
Selby Ian
Shadbahr Tolou
Solares Eduardo González
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Walton Nicholas
Wassink Marcel
Weir-McCall Jonathan R.
Xing Xiaodan
Yang Guang
Publication venue: Nature Publishing Group
Publication date: 27/07/2023
Field of study

The National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitLab. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations for similar future initiatives.Peer reviewe

Helsingin yliopiston digitaalinen arkisto