Search CORE

11 research outputs found

Your diffusion model secretly knows the dimension of the data manifold

Author: Batzolis Georgios
Schönlieb Carola-Bibiane
Stanczuk Jan
Publication venue
Publication date: 23/12/2022
Field of study

In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A trained diffusion model approximates the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximum likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. Our method outperforms linear methods for dimensionality detection such as PPCA in controlled experiments.Comment: arXiv admin note: text overlap with arXiv:2207.0978

arXiv.org e-Print Archive

Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation

Author: Budd Chris
Deveney Teo
Kreusser Lisa Maria
Schönlieb Carola-Bibiane
Stanczuk Jan
Publication venue
Publication date: 27/11/2023
Field of study

Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to their state-of-the art performance in many generation tasks while relying on mathematical foundations such as stochastic differential equations (SDEs) and ordinary differential equations (ODEs). Empirically, it has been reported that ODE based samples are inferior to SDE based samples. In this paper we rigorously describe the range of dynamics and approximations that arise when training score-based diffusion models, including the true SDE dynamics, the neural approximations, the various approximate particle dynamics that result, as well as their associated Fokker--Planck equations and the neural network approximations of these Fokker--Planck equations. We systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models, and link it to an associated Fokker--Planck equation. We derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker--Planck residual. We also show numerically that conventional score-based diffusion models can exhibit significant differences between ODE- and SDE-induced distributions which we demonstrate using explicit comparisons. Moreover, we show numerically that reducing the Fokker--Planck residual by adding it as an additional regularisation term leads to closing the gap between ODE- and SDE-induced distributions. Our experiments suggest that this regularisation can improve the distribution generated by the ODE, however that this can come at the cost of degraded SDE sample quality

arXiv.org e-Print Archive

Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation

Author: Budd Chris
Deveney Teo
Kreusser Lisa Maria
Schönlieb Carola-Bibiane
Stanczuk Jan
Publication venue: 'Center for Open Science'
Publication date: 27/11/2023
Field of study

OPUS

Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

Author: Ai Tao
Ako Emmanuel
Aviles-Rivero Angelica I.
Azadbakht Hojjat
Babar Judith
Beer Lucian
Bradley Kyle
Cafolla Conor
Driggs Derek
Etmann Christian
Gilbey Julian
Gkrania-Klotsas Effrossyni
Gonzalez Paula Martin
Gozaliasl Ghassem
Hofmanninger Johannes
Holzer Markus
Jacob Joseph
Jefferson Emily
Ji Kangyu
Korhonen Anna
Langs Georg
Lin Weizhe
Lio Pietro
Lowe Josh
McCague Cathal
Niu Zhangming
Ortet Maria Delgado
Patel Mishal
Preller Jacobus
Prosch Helmut
Roberts Michael
Rudd James H. F.
Ruggiero Alessandro
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Stranks Samuel
Sánchez Lorena Escudero
Tang Jing
Teare Philip
Teng Zhongzhao
Thillai Muhunthan
Thorpe Matthew
Ursprung Stephan
Walton Nicholas
Wassin Marcel
Weir-McCall Jonathan R.
Yang Guang
Yeung Michael
Zha Yunfei
Zhang Kang
Zhu Xiaoxiang
Publication venue: Nature Machine Intelligence
Publication date: 01/01/2021
Field of study

Abstract: Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts

Institute of Transport Research:Publications

Apollo (Cambridge)

Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation

Author: Budd Chris
Deveney Teo
Kreusser Lisa Maria
Schönlieb Carola-Bibiane
Stanczuk Jan
Publication venue: 'Center for Open Science'
Publication date: 27/11/2023
Field of study

OPUS

Recommended from our members

The impact of imputation quality on machine learning classifiers for datasets with missing values

Author: Aston John A. D.
Dittmer Sören
Gilbey Julian
Lió Pietro
Mirtti Tuomas
Patel Mishal
Preller Jacobus
Rannikko Antti Sakari
Roberts Michael
Rudd James H. F.
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Torné Ramon Viñas
Publication venue: Communications Medicine
Publication date: 06/10/2023
Field of study

Background: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. Results: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. Conclusions: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable

Apollo (Cambridge)

Recommended from our members

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Author: AIX-COVNET Collaboration
Aston John AD
Dittmer Sören
Gilbey Julian
Lió Pietro
Mirtti Tuomas
Patel Mishal
Preller Jacobus
Rannikko Antti Sakari
Roberts Michael
Rudd James HF
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Torné Ramon Viñas
Publication venue: Commun Med (Lond)
Publication date: 17/11/2023
Field of study

BACKGROUND: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable

Apollo (Cambridge)

Recommended from our members

The impact of imputation quality on machine learning classifiers for datasets with missing values.

Author: AIX-COVNET Collaboration
Aston John AD
Dittmer Sören
Gilbey Julian
Lió Pietro
Mirtti Tuomas
Patel Mishal
Preller Jacobus
Rannikko Antti Sakari
Roberts Michael
Rudd James HF
Sala Evis
Schönlieb Carola-Bibiane
Shadbahr Tolou
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Torné Ramon Viñas
Publication venue: Commun Med (Lond)
Publication date: 06/10/2023
Field of study

Apollo (Cambridge)

A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data

Author: Aston John A. D.
Babar Judith
Breger Anna
Collaboration AIX-COVNET
Dittmer Sören
Escudero Sánchez Lorena
Gilbey Julian
Gkrania-Klotsas Effrossyni
Holzer Markus
Jefferson Emily
Korhonen Anna
Langs Georg
Li Ming
Liò Pietro
Nan Yang
Patel Mishal
Preller Jacobus
Prosch Helmut
Roberts Michael
Rudd James H. F.
Sala Evis
Schönlieb Carola-Bibiane
Selby Ian
Shadbahr Tolou
Solares Eduardo González
Stanczuk Jan
Tang Jing
Teare Philip
Thorpe Matthew
Walton Nicholas
Wassink Marcel
Weir-McCall Jonathan R.
Xing Xiaodan
Yang Guang
Publication venue: Nature Publishing Group
Publication date: 27/07/2023
Field of study

The National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitLab. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations for similar future initiatives.Peer reviewe

Helsingin yliopiston digitaalinen arkisto