41 research outputs found

    A Bayesian Approach for Analysis of Whole-Genome Bisulfite Sequencing Data Identifies Disease-Associated Changes in DNA Methylation

    Get PDF
    DNA methylation is a key epigenetic modification involved in gene regulation whose contribution to disease susceptibility remains to be fully understood. Here, we present a novel Bayesian smoothing approach (called ABBA) to detect differentially methylated regions (DMRs) from whole-genome bisulfite sequencing (WGBS). We also show how this approach can be leveraged to identify disease-associated changes in DNA methylation, suggesting mechanisms through which these alterations might affect disease. From a data modeling perspective, ABBA has the distinctive feature of automatically adapting to different correlation structures in CpG methylation levels across the genome while taking into account the distance between CpG sites as a covariate. Our simulation study shows that ABBA has greater power to detect DMRs than existing methods, providing an accurate identification of DMRs in the large majority of simulated cases. To empirically demonstrate the method’s efficacy in generating biological hypotheses, we performed WGBS of primary macrophages derived from an experimental rat system of glomerulonephritis and used ABBA to identify >1000 disease-associated DMRs. Investigation of these DMRs revealed differential DNA methylation localized to a 600 bp region in the promoter of the Ifitm3 gene. This was confirmed by ChIP-seq and RNA-seq analyses, showing differential transcription factor binding at the Ifitm3 promoter by JunD (an established determinant of glomerulonephritis), and a consistent change in Ifitm3 expression. Our ABBA analysis allowed us to propose a new role for Ifitm3 in the pathogenesis of glomerulonephritis via a mechanism involving promoter hypermethylation that is associated with Ifitm3 repression in the rat strain susceptible to glomerulonephritis

    Statistical Methods for Bayesian Clinical Trial Design and DNA Methylation Deconvolution

    Get PDF
    We consider the Bayesian clinical trial design problem in situations where a historical trial is available to inform the design and analysis of a future trial. Currently the FDA requires that all proposed designs exhibit reasonable type I error control. Traditionally, frequentist type I error control has been required. This is currently the case in the Center for Drug Evaluation and Research but no longer in the Center for Devices and Radiological Health. The requirement that a design exhibit frequentist type I error control necessitates that all prior information be discarded. We propose several Bayesian solutions that balance the need to control type I errors with the desire to utilize high quality prior information. For scenarios where the historical trial informs the parameter being tested, we propose Bayesian versions of the type I error rate and power that are defined with respect to the posterior distribution for the parameters given the historical data and conditional on the respective hypothesis being true. We demonstrate that in designs that control the Bayesian type I error rate, meaningful amounts of prior information can be borrowed but that the size of the new trial must be relatively large to justify borrowing a large amount of historical information. We tailor our design methodology for survival applications using proportional hazards and cure rate models. We also develop Bayesian adaptive designs for large cardiovascular outcomes trials (CVOTs) which incorporate control information from a historical CVOT conducted in a similar patient population. We propose an all-or-nothing adaptive design utilizing the power prior as well as a dynamic borrowing adaptive design utilizing a novel extension of the joint power prior. Separately, we present a statistical deconvolution method for DNA methylation data from bisulfite sequencing experiments. We propose a joint model for methylation data from a set of heterogeneous tissue samples and another set of reference tissue samples. Unlike other methylation deconvolution methods, our method allows one to estimate the heterogeneous tissue composition and provides improved estimates of cell type-specific methylation levels through the process of deconvolution. We demonstrate our method using data from DNA mixture tissues and simulation studies.Doctor of Philosoph

    COUPLING MACHINE LEARNING WITH FIDUCIAL INFERENCE, GENETICS AND EPIGENETICS

    Get PDF
    This dissertation consists of three research topics. In the first part, we present deep fiducial inference and approximate fiducial computation (AFC) algorithm. Since the mid-2000s, there has been a resurrection of interest in modern modifications of fiducial inference. To date, the main computational tool to extract a generalized fiducial distribution is Markov chain Monte Carlo (MCMC). We propose an alternative way of computing a generalized fiducial distribution that could be used in complex situations. In particular, to overcome the difficulty when the unnormalized fiducial density (needed for MCMC) is intractable, we design a fiducial autoencoder (FAE). The fitted FAE is used to generate generalized fiducial samples of the unknown parameters. To increase accuracy, we then apply an approximate fiducial computation (AFC) algorithm, by rejecting samples that do not replicate the observed data well enough when plugged into a decoder. Our numerical experiments show the effectiveness of our FAE-based inverse solution and the excellent coverage performance of the AFC corrected FAE solution. In the second part, we present SMNN, a supervised mutual nearest neighbor method, for batch effect correction in single-cell RNA-sequencing (scRNA-seq) data. Batch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve the effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over alternative methods including MNN, Seurat v3 and LIGER. Furthermore, SMNN retains more cell-type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841.0%. In the third part, we present an ensemble imputation framework for DNA methylation across different platforms. DNA methylation at CpG dinucleotides is a biological process by which methyl groups are added to the DNA molecule. It is one of the most extensively studied epigenetic marks. With technological advancements, geneticists can profile DNA methylation with multiple reliable approaches. However, different profiling platforms can differ substantially in the density and measurements for the CpGs they assess, consequently hindering joint analysis across platforms. For this project, we focus on the two most commonly used commercial methylation platforms from the Illumina company, specifically aiming to impute from the HumanMethylation450 (HM450) BeadChip to ~850K CpG sites on the HumanMethylationEPIC (HM850) BeadChip. We present CUE, CpG imputation Ensemble, which ensemble multiple classical statistical and modern machine learning methods. Our results highlight CUE as a valuable tool for imputing from HM450 to HM850.Doctor of Philosoph

    Spatial statistical modelling of epigenomic variability

    Get PDF
    Each cell in our body carries the same genetic information encoded in the DNA, yet the human organism contains hundreds of cell types which differ substantially in physiology and functionality. This variability stems from the existence of regulatory mechanisms that control gene expression, and hence phenotype. The field of epigenetics studies how changes in biochemical factors, other than the DNA sequence itself, might affect gene regulation. The advent of high throughput sequencing platforms has enabled the profiling of different epigenetic marks on a genome-wide scale; however, bespoke computational methods are required to interpret these high-dimensional data and investigate the coupling between the epigenome and transcriptome. This thesis contributes to the development of statistical models to capture spatial correlations of epigenetic marks, with the main focus being DNA methylation. To this end, we developed BPRMeth (Bayesian Probit Regression for Methylation), a probabilistic model for extracting higher order methylation features that precisely quantify the spatial variability of bulk DNA methylation patterns. Using such features, we constructed an accurate machine learning predictor of gene expression from DNA methylation and identified prototypical methylation profiles that explain most of the variability across promoter regions. The BPRMeth model, and its algorithmic implementation, were subsequently substantially extended both to accommodate different data types, and to improve the scalability of the algorithm. Bulk experiments have paved the way for mapping the epigenetic landscape, nonetheless, they fall short of explaining the epigenetic heterogeneity and quantifying its dynamics, which inherently occur at the single cell level. Single cell bisulfite sequencing protocols have been recently developed, however, due to intrinsic limitations of the technology they result in extremely sparse coverage of CpG sites, effectively limiting the analysis repertoire to a semi-quantitative level. To overcome these difficulties we developed Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical model that leverages local correlations between neighbouring CpGs and similarity between individual cells to jointly impute missing methylation states, and cluster cells based on their genome-wide methylation profiles. A recent experimental innovation enables the parallel profiling of DNA methylation, transcription and chromatin accessibility (scNMT-seq), making it possible to link transcriptional and epigenetic heterogeneity at the single cell resolution. For the scNMT-seq study, we applied the extended BPRMeth model to quantify cell-to-cell chromatin accessibility heterogeneity around promoter regions and subsequently link it to transcript abundance. This revealed that genes with conserved accessibility profiles are associated with higher average expression levels. In summary, this thesis proposes statistical methods to model and interpret epigenomic data generated from high throughput sequencing experiments. Due to their statistical power and flexibility we anticipate that these methods will be applicable to future sequencing technologies and become widespread tools in the high throughput bioinformatics workbench for performing biomedical data analysis

    Health privacy : methods for privacy-preserving data sharing of methylation, microbiome and eye tracking data

    Get PDF
    This thesis studies the privacy risks of biomedical data and develops mechanisms for privacy-preserving data sharing. The contribution of this work is two-fold: First, we demonstrate privacy risks of a variety of biomedical data types such as DNA methylation data, microbiome data and eye tracking data. Despite being less stable than well-studied genome data and more prone to environmental changes, well-known privacy attacks can be adopted and threaten the privacy of data donors. Nevertheless, data sharing is crucial to advance biomedical research given that collection the data of a sufficiently large population is complex and costly. Therefore, we develop as a second step privacy- preserving tools that enable researchers to share such biomedical data. and second, we equip researchers with tools to enable privacy-preserving data sharing. These tools are mostly based on differential privacy, machine learning techniques and adversarial examples and carefully tuned to the concrete use case to maintain data utility while preserving privacy.Diese Dissertation beleuchtet Risiken für die Privatsphäre von biomedizinischen Daten und entwickelt Mechanismen für privatsphäre-erthaltendes Teilen von Daten. Dies zerfällt in zwei Teile: Zunächst zeigen wir die Risiken für die Privatsphäre auf, die von biomedizinischen Daten wie DNA Methylierung, Mikrobiomdaten und bei der Aufnahme von Augenbewegungen vorkommen. Obwohl diese Daten weniger stabil sind als Genomdaten, deren Risiken der Forschung gut bekannt sind, und sich mehr unter Umwelteinflüssen ändern, können bekannte Angriffe angepasst werden und bedrohen die Privatsphäre der Datenspender. Dennoch ist das Teilen von Daten essentiell um biomedizinische Forschung voranzutreiben, denn Daten von einer ausreichend großen Studienpopulation zu sammeln ist aufwändig und teuer. Deshalb entwickeln wir als zweiten Schritt privatsphäre-erhaltende Techniken, die es Wissenschaftlern erlauben, solche biomedizinischen Daten zu teilen. Diese Techniken basieren im Wesentlichen auf differentieller Privatsphäre und feindlichen Beispielen und sind sorgfältig auf den konkreten Einsatzzweck angepasst um den Nutzen der Daten zu erhalten und gleichzeitig die Privatsphäre zu schützen

    The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond

    Get PDF
    Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases

    Quantifying and mitigating privacy risks in biomedical data

    Get PDF
    Die stetig sinkenden Kosten für molekulares Profiling haben der Biomedizin zahlreiche neue Arten biomedizinischer Daten geliefert und den Durchbruch für eine präzisere und personalisierte Medizin ermöglicht. Die Veröffentlichung dieser inhärent hochsensiblen und miteinander verbundenen Daten stellt jedoch eine neue Bedrohung für unsere Privatsphäre dar. Während die IT-Sicherheitsforschung sich bisher hauptsächlich auf die Auswirkung genetischer Daten auf die Privatsphäre konzentriert hat, wurden die vielfältigen Risiken durch andere Arten biomedizinischer Daten – epigenetischer Daten im Speziellen – größtenteils außer Acht gelassen. Diese Dissertation stellt Verfahren zur Messung und Abwehr solcher Privatsphärerisiken vor. Neben dem Genom konzentrieren wir uns auf zwei der wichtigsten gesundheitsrelevanten epigenetischen Elemente: microRNAs und DNA-Methylierung. Wir quantifizieren die Privatsphäre für die folgenden realistischen Angriffe: (1) Verknüpfung von Profilen über die Zeit, Verknüpfung verschiedener Datentypen und verwandter Personen, (2) Feststellung der Studienteilnahme und (3) Inferenz von Attributen. Unsere Resultate bekräftigen, dass die Privatsphärerisiken solcher Daten ernst genommen werden müssen. Zudem präsentieren und evaluieren wir Lösungen zum Schutz der Privatsphäre. Sie reichen von der Anwendung von Differential Privacy unter Berücksichtigung des Nutzwertes bis zu kryptographischen Protokollen zur sicheren Auswertung eines Random Forests.The decreasing costs of molecular profiling have fueled the biomedical research community with a plethora of new types of biomedical data, allowing for a breakthrough towards a more precise and personalized medicine. However, the release of these intrinsically highly sensitive, interdependent data poses a new severe privacy threat. So far, the security community has mostly focused on privacy risks arising from genomic data. However, the manifold privacy risks stemming from other types of biomedical data – and epigenetic data in particular – have been largely overlooked. In this thesis, we provide means to quantify and protect the privacy of individuals’ biomedical data. Besides the genome, we specifically focus on two of the most important epigenetic elements influencing human health: microRNAs and DNA methylation. We quantify the privacy for multiple realistic attack scenarios, namely, (1) linkability attacks along the temporal dimension, between different types of data, and between related individuals, (2) membership attacks, and (3) inference attacks. Our results underline that the privacy risks inherent to biomedical data have to be taken seriously. Moreover, we present and evaluate solutions to preserve the privacy of individuals. Our mitigation techniques stretch from the differentially private release of epigenetic data, considering its utility, up to cryptographic constructions to securely, and privately evaluate a random forest on a patient’s data

    Development of digital PCR DNA methylation assays for blood plasma-based diagnosis of lung cancer

    Get PDF
    Lung cancer is the leading cause of cancer-related death and is usually diagnosed at advanced stage leading to poor patient survival. Therefore there is a pressing need for early detection of disease. DNA methylation is an early event in carcinogenesis and a limited number of diagnostic markers have been developed for clinical use. This thesis seeks to address whether the development and application of novel DNA methylation assays can diagnose lung cancer at early stage. Previously identified DNA methylation biomarkers, along with novel targets identified by methylation microarray, were developed in multiplex assay format. Twelve markers were used to screen 417 bronchoalveolar lavage specimens from Liverpool Lung Project (LLP) subjects divided into training and validation sets. The optimal biomarker panel (CDKN2A, RARB and TERT) demonstrated improved clinical sensitivity and specificity (Sensitivity/Specificity: 85.7%/93.8%, AUC: 0.91) compared to previous studies. The optimal methylation algorithm detected more than 60% of stage T1 tumours and 93 cytologically occult lung cancer cases. Eight methylated DNA assays were optimised for use with the newly developed Droplet Digital™ PCR (ddPCR) platform and a targeted pre-amplification technique, MethPlex enrichment, was developed. I established a comprehensive analytical framework to compare performance of methylation-specific ddPCR and quantitative methylation-specific PCR directly and in combination with MethPlex enrichment. ddPCR demonstrated greater precision and linearity, lower limit of detection (WT1 MethPlex ddPCR LOD95 = 1.86 GE), and discriminated twofold differences in methylated DNA input. MethPlex ddPCR detected DNA methylation more frequently in lung cancer patient plasma than in controls in a retrospective case-control study. Technical methylation controls were consistently and precisely detected at inputs as low as 3 methylated copies. Discriminatory efficiency of marker combinations was inadequate, presumably due to limitations in DNA extraction methodology. DNA methylation biomarker diagnostic performance in bronchoalveolar lavage merits further validation in a prospective trial. MethPlex ddPCR analysis showed great promise, demonstrating highly sensitive DNA methylation detection in technical assessment. It is expected that appropriate DNA extraction procedures and higher cfDNA yields will lead to much improved clinical discriminatory efficiency
    corecore