75 research outputs found

    Private and Collaborative Kaplan-Meier Estimators

    Full text link
    Kaplan-Meier estimators capture the survival behavior of a cohort. They are one of the key statistics in survival analysis. As with any estimator, they become more accurate in presence of larger datasets. This motivates multiple data holders to share their data in order to calculate a more accurate Kaplan-Meier estimator. However, these survival datasets often contain sensitive information of individuals and it is the responsibility of the data holders to protect their data, thus a naive sharing of data is often not viable. In this work, we propose two novel differentially private schemes that are facilitated by our novel synthetic dataset generation method. Based on these scheme we propose various paths that allow a joint estimation of the Kaplan-Meier curves with strict privacy guarantees. Our contribution includes a taxonomy of methods for this task and an extensive experimental exploration and evaluation based on this structure. We show that we can construct a joint, global Kaplan-Meier estimator which satisfies very tight privacy guarantees and with no statistically-significant utility loss compared to the non-private centralized setting

    Pinocchio-Based Adaptive zk-SNARKs and Secure/Correct Adaptive Function Evaluation

    Get PDF
    Pinocchio is a practical zk-SNARK that allows a prover to perform cryptographically verifiable computations with verification effort sometimes less than performing the computation itself. A recent proposal showed how to make Pinocchio adaptive (or ``hash-and-prove\u27\u27), i.e., to enable proofs with respect to computation-independent commitments. This enables computations to be chosen after the commitments have been produced, and for data to be shared in different computations in a flexible way. Unfortunately, this proposal is not zero-knowledge. In particular, it cannot be combined with Trinocchio, a system in which Pinocchio is outsourced to three workers that do not learn the inputs thanks to multi-party computation (MPC). In this paper, we show how to make Pinocchio adaptive in a zero-knowledge way; apply it to make Trinocchio work on computation-independent commitments; present tooling to easily program fleible verifiable computations (with or without MPC); and use it to build a prototype in a medical research case study

    Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics

    Full text link
    Survival analysis is a fundamental tool in medicine, modeling the time until an event of interest occurs in a population. However, in real-world applications, survival data are often incomplete, censored, distributed, and confidential, especially in healthcare settings where privacy is critical. The scarcity of data can severely limit the scalability of survival models to distributed applications that rely on large data pools. Federated learning is a promising technique that enables machine learning models to be trained on multiple datasets without compromising user privacy, making it particularly well-suited for addressing the challenges of survival data and large-scale survival applications. Despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis. In this work, we propose an extension of the Federated Survival Forest algorithm, called FedSurF++. This federated ensemble method constructs random survival forests in heterogeneous federations. Specifically, we investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. The key advantage of FedSurF++ is its ability to achieve comparable performance to existing methods while requiring only a single communication round to complete. The extensive empirical investigation results in a significant improvement from the algorithmic and privacy preservation perspectives, making the original FedSurF algorithm more efficient, robust, and private. We also present results on two real-world datasets demonstrating the success of FedSurF++ in real-world healthcare studies. Our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy

    Personalising lung cancer screening with machine learning

    Get PDF
    Personalised screening is based on a straightforward concept: repeated risk assessment linked to tailored management. However, delivering such programmes at scale is complex. In this work, I aimed to contribute to two areas: the simplification of risk assessment to facilitate the implementation of personalised screening for lung cancer; and, the use of synthetic data to support privacy-preserving analytics in the absence of access to patient records. I first present parsimonious machine learning models for lung cancer screening, demonstrating an approach that couples the performance of model-based risk prediction with the simplicity of risk-factor-based criteria. I trained models to predict the five-year risk of developing or dying from lung cancer using UK Biobank and US National Lung Screening Trial participants before external validation amongst temporally and geographically distinct ever-smokers in the US Prostate, Lung, Colorectal and Ovarian Screening trial. I found that three predictors – age, smoking duration, and pack-years – within an ensemble machine learning framework achieved or exceeded parity in discrimination, calibration, and net benefit with comparators. Furthermore, I show that these models are more sensitive than risk-factor-based criteria, such as those currently recommended by the US Preventive Services Taskforce. For the implementation of more personalised healthcare, researchers and developers require ready access to high-quality datasets. As such data are sensitive, their use is subject to tight control, whilst the majority of data present in electronic records are not available for research use. Synthetic data are algorithmically generated but can maintain the statistical relationships present within an original dataset. In this work, I used explicitly privacy-preserving generators to create synthetic versions of the UK Biobank before we performed exploratory data analysis and prognostic model development. Comparing results when using the synthetic against the real datasets, we show the potential for synthetic data in facilitating prognostic modelling

    Capsule Network-based Radiomics: From Diagnosis to Treatment

    Get PDF
    Recent advancements in signal processing and machine learning coupled with developments of electronic medical record keeping in hospitals have resulted in a surge of significant interest in ``radiomics". Radiomics is an emerging and relatively new research field, which refers to semi-quantitative and/or quantitative features extracted from medical images with the goal of developing predictive and/or prognostic models. Radiomics is expected to become a critical component for integration of image-derived information for personalized treatment in the near future. The conventional radiomics workflow is, typically, based on extracting pre-designed features (also referred to as hand-crafted or engineered features) from a segmented region of interest. Clinical application of hand-crafted radiomics is, however, limited by the fact that features are pre-defined and extracted without taking the desired outcome into account. The aforementioned drawback has motivated trends towards development of deep learning-based radiomics (also referred to as discovery radiomics). Discovery radiomics has the advantage of learning the desired features on its own in an end-to-end fashion. Discovery radiomics has several applications in disease prediction/ diagnosis. Through this Ph.D. thesis, we develop deep learning-based architectures to address the following critical challenges identified within the radiomics domain. First, we cover the tumor type classification problem, which is of high importance for treatment selection. We address this problem, by designing a Capsule network-based architecture that has several advantages over existing solutions such as eliminating the need for access to a huge amount of training data, and its capability to learn input transformations on its own. We apply different modifications to the Capsule network architecture to make it more suitable for radiomics. At one hand, we equip the proposed architecture with access to the tumor boundary box, and on the other hand, a multi-scale Capsule network architecture is designed. Furthermore, capitalizing on the advantages of ensemble learning paradigms, we design a boosting and also a mixture of experts capsule network. A Bayesian capsule network is also developed to capture the uncertainty of the tumor classification. Beside knowing the tumor type (through classification), predicting the patient's response to treatment plays an important role in treatment design. Predicting patient's response, including survival and tumor recurrence, is another goal of this thesis, which we address by designing a deep learning-based model that takes not only the medical images, but also different clinical factors (such as age and gender) as inputs. Finally, COVID-19 diagnosis, another challenging and crucial problem within the radiomics domain, is dealt with using both X-ray and Computed Tomography (CT) images (in particular low-dose ones), where two in-house datasets are collected for the latter and different capsule network-based models are developed for COVID-19 diagnosis

    A comparison of the CAR and DAGAR spatial random effects models with an application to diabetics rate estimation in Belgium

    Get PDF
    When hierarchically modelling an epidemiological phenomenon on a finite collection of sites in space, one must always take a latent spatial effect into account in order to capture the correlation structure that links the phenomenon to the territory. In this work, we compare two autoregressive spatial models that can be used for this purpose: the classical CAR model and the more recent DAGAR model. Differently from the former, the latter has a desirable property: its ρ parameter can be naturally interpreted as the average neighbor pair correlation and, in addition, this parameter can be directly estimated when the effect is modelled using a DAGAR rather than a CAR structure. As an application, we model the diabetics rate in Belgium in 2014 and show the adequacy of these models in predicting the response variable when no covariates are available

    A Statistical Approach to the Alignment of fMRI Data

    Get PDF
    Multi-subject functional Magnetic Resonance Image studies are critical. The anatomical and functional structure varies across subjects, so the image alignment is necessary. We define a probabilistic model to describe functional alignment. Imposing a prior distribution, as the matrix Fisher Von Mises distribution, of the orthogonal transformation parameter, the anatomical information is embedded in the estimation of the parameters, i.e., penalizing the combination of spatially distant voxels. Real applications show an improvement in the classification and interpretability of the results compared to various functional alignment methods

    Oesephageal cancer; staging, surgery and survival

    Get PDF
    corecore