374 research outputs found
Precision-Recall Curves Using Information Divergence Frontiers
Despite the tremendous progress in the estimation of generative models, the
development of tools for diagnosing their failures and assessing their
performance has advanced at a much slower pace. Recent developments have
investigated metrics that quantify which parts of the true distribution is
modeled well, and, on the contrary, what the model fails to capture, akin to
precision and recall in information retrieval. In this paper, we present a
general evaluation framework for generative models that measures the trade-off
between precision and recall using R\'enyi divergences. Our framework provides
a novel perspective on existing techniques and extends them to more general
domains. As a key advantage, this formulation encompasses both continuous and
discrete models and allows for the design of efficient algorithms that do not
have to quantize the data. We further analyze the biases of the approximations
used in practice.Comment: Updated to the AISTATS 2020 versio
Toward Sharing Brain Images: Differentially Private TOF-MRA Images With Segmentation Labels Using Generative Adversarial Networks
Sharing labeled data is crucial to acquire large datasets for various Deep Learning applications. In medical imaging, this is often not feasible due to privacy regulations. Whereas anonymization would be a solution, standard techniques have been shown to be partially reversible. Here, synthetic data using a Generative Adversarial Network (GAN) with differential privacy guarantees could be a solution to ensure the patient's privacy while maintaining the predictive properties of the data. In this study, we implemented a Wasserstein GAN (WGAN) with and without differential privacy guarantees to generate privacy-preserving labeled Time-of-Flight Magnetic Resonance Angiography (TOF-MRA) image patches for brain vessel segmentation. The synthesized image-label pairs were used to train a U-net which was evaluated in terms of the segmentation performance on real patient images from two different datasets. Additionally, the Fréchet Inception Distance (FID) was calculated between the generated images and the real images to assess their similarity. During the evaluation using the U-Net and the FID, we explored the effect of different levels of privacy which was represented by the parameter ϵ. With stricter privacy guarantees, the segmentation performance and the similarity to the real patient images in terms of FID decreased. Our best segmentation model, trained on synthetic and private data, achieved a Dice Similarity Coefficient (DSC) of 0.75 for ϵ = 7.4 compared to 0.84 for ϵ = ∞ in a brain vessel segmentation paradigm (DSC of 0.69 and 0.88 on the second test set, respectively). We identified a threshold of ϵ <5 for which the performance (DSC <0.61) became unstable and not usable. Our synthesized labeled TOF-MRA images with strict privacy guarantees retained predictive properties necessary for segmenting the brain vessels. Although further research is warranted regarding generalizability to other imaging modalities and performance improvement, our results mark an encouraging first step for privacy-preserving data sharing in medical imaging
Toward sharing brain images: Differentially private TOF-MRA images with segmentation labels using generative adversarial networks
Sharing labeled data is crucial to acquire large datasets for various Deep Learning applications. In medical imaging, this is often not feasible due to privacy regulations. Whereas anonymization would be a solution, standard techniques have been shown to be partially reversible. Here, synthetic data using a Generative Adversarial Network (GAN) with differential privacy guarantees could be a solution to ensure the patient's privacy while maintaining the predictive properties of the data. In this study, we implemented a Wasserstein GAN (WGAN) with and without differential privacy guarantees to generate privacy-preserving labeled Time-of-Flight Magnetic Resonance Angiography (TOF-MRA) image patches for brain vessel segmentation. The synthesized image-label pairs were used to train a U-net which was evaluated in terms of the segmentation performance on real patient images from two different datasets. Additionally, the Fréchet Inception Distance (FID) was calculated between the generated images and the real images to assess their similarity. During the evaluation using the U-Net and the FID, we explored the effect of different levels of privacy which was represented by the parameter ϵ. With stricter privacy guarantees, the segmentation performance and the similarity to the real patient images in terms of FID decreased. Our best segmentation model, trained on synthetic and private data, achieved a Dice Similarity Coefficient (DSC) of 0.75 for ϵ = 7.4 compared to 0.84 for ϵ = ∞ in a brain vessel segmentation paradigm (DSC of 0.69 and 0.88 on the second test set, respectively). We identified a threshold of ϵ <5 for which the performance (DSC <0.61) became unstable and not usable. Our synthesized labeled TOF-MRA images with strict privacy guarantees retained predictive properties necessary for segmenting the brain vessels. Although further research is warranted regarding generalizability to other imaging modalities and performance improvement, our results mark an encouraging first step for privacy-preserving data sharing in medical imaging
Generating tabular datasets under differential privacy
Machine Learning (ML) is accelerating progress across fields and industries,
but relies on accessible and high-quality training data. Some of the most
important datasets are found in biomedical and financial domains in the form of
spreadsheets and relational databases. But this tabular data is often sensitive
in nature. Synthetic data generation offers the potential to unlock sensitive
data, but generative models tend to memorise and regurgitate training data,
which undermines the privacy goal. To remedy this, researchers have
incorporated the mathematical framework of Differential Privacy (DP) into the
training process of deep neural networks. But this creates a trade-off between
the quality and privacy of the resulting data. Generative Adversarial Networks
(GANs) are the dominant paradigm for synthesising tabular data under DP, but
suffer from unstable adversarial training and mode collapse, which are
exacerbated by the privacy constraints and challenging tabular data modality.
This work optimises the quality-privacy trade-off of generative models,
producing higher quality tabular datasets with the same privacy guarantees. We
implement novel end-to-end models that leverage attention mechanisms to learn
reversible tabular representations. We also introduce TableDiffusion, the first
differentially-private diffusion model for tabular data synthesis. Our
experiments show that TableDiffusion produces higher-fidelity synthetic
datasets, avoids the mode collapse problem, and achieves state-of-the-art
performance on privatised tabular data synthesis. By implementing
TableDiffusion to predict the added noise, we enabled it to bypass the
challenges of reconstructing mixed-type tabular data. Overall, the diffusion
paradigm proves vastly more data and privacy efficient than the adversarial
paradigm, due to augmented re-use of each data batch and a smoother iterative
training process
The Representation Jensen-Shannon Divergence
Statistical divergences quantify the difference between probability
distributions finding multiple uses in machine-learning. However, a fundamental
challenge is to estimate divergence from empirical samples since the underlying
distributions of the data are usually unknown. In this work, we propose the
representation Jensen-Shannon Divergence, a novel divergence based on
covariance operators in reproducing kernel Hilbert spaces (RKHS). Our approach
embeds the data distributions in an RKHS and exploits the spectrum of the
covariance operators of the representations. We provide an estimator from
empirical covariance matrices by explicitly mapping the data to an RKHS using
Fourier features. This estimator is flexible, scalable, differentiable, and
suitable for minibatch-based optimization problems. Additionally, we provide an
estimator based on kernel matrices without having an explicit mapping to the
RKHS. We show that this quantity is a lower bound on the Jensen-Shannon
divergence, and we propose a variational approach to estimate it. We applied
our divergence to two-sample testing outperforming related state-of-the-art
techniques in several datasets. We used the representation Jensen-Shannon
divergence as a cost function to train generative adversarial networks which
intrinsically avoids mode collapse and encourages diversity
- …