155 research outputs found

    Nonparametric false discovery rate control for identifying simultaneous signals

    Get PDF
    It is frequently of interest to jointly analyze multiple sequences of multiple tests in order to identify simultaneous signals, defined as features tested in multiple studies whose test statistics are non-null in each. In many problems, however, the null distributions of the test statistics may be complicated or even unknown, and there do not currently exist any procedures that can be employed in these cases. This paper proposes a new nonparametric procedure that can identify simultaneous signals across multiple studies even without knowing the null distributions of the test statistics. The method is shown to asymptotically control the false discovery rate, and in simulations had excellent power and error control. In an analysis of gene expression and histone acetylation patterns in the brains of mice exposed to a conspecific intruder, it identified genes that were both differentially expressed and next to differentially accessible chromatin. The proposed method is available in the R package github.com/sdzhao/ssa

    Nonparametric False Discovery Rate Control for Identifying Simultaneous Signals

    Get PDF
    It is frequently of interest to identify simultaneous signals, defined as features that exhibit statistical significance across each of several independent experiments. For example, genes that are consistently differentially expressed across experiments in different animal species can reveal evolutionarily conserved biological mechanisms. However, in some problems the test statistics corresponding to these features can have complicated or unknown null distributions. This paper proposes a novel nonparametric false discovery rate control procedure that can identify simultaneous signals even without knowing these null distributions. The method is shown, theoretically and in simulations, to asymptotically control the false discovery rate. It was also used to identify genes that were both differentially expressed and proximal to differentially accessible chromatin in the brains of mice exposed to a conspecific intruder. The proposed method is available in the R package github.com/sdzhao/ssa

    A physical basis for quantitative ChIP-sequencing

    Get PDF
    ChIP followed by next-generation sequencing (ChIP-Seq) is a key technique for mapping the distribution of histone posttranslational modifications (PTMs) and chromatin-associated factors across genomes. There is a perceived challenge to define a quantitative scale for ChIP-Seq data, and as such, several approaches making use of exogenous additives, or "spike-ins," have recently been developed. Herein, we report on the development of a quantitative, physical model defining ChIP-Seq. The quantitative scale on which ChIP-Seq results should be compared emerges from the model. To test the model and demonstrate the quantitative scale, we examine the impacts of an EZH2 inhibitor through the lens of ChIP-Seq. We report a significant increase in immunoprecipitation of presumed off-target histone PTMs after inhibitor treatment, a trend predicted by the model but contrary to spike-in-based indications. Our work also identifies a sensitivity issue in spike-in normalization that has not been considered in the literature, placing limitations on its utility and trustworthiness. We call our new approach the sans-spike-in method for quantitative ChIP-sequencing (siQ-ChIP). A number of changes in community practice of ChIP-Seq, data reporting, and analysis are motivated by this work

    Homologous recombination-deficient cancers: approaches to improve treatment and patient selection

    Get PDF
    In order for cells to divide, all the DNA in a cell must be copied and divided into two new cells. However, DNA in our cells is constantly dealing with different types of damage, either from factors outside (eg UV rays in sunlight) or inside the body (eg due to errors that occur during the copying of the DNA). To ensure that this damage does not lead to permanent changes, cells have DNA damage repair mechanisms. An important mechanism is homologous recombination (HR) that repairs double-stranded DNA breaks. Without this mechanism, cells cannot survive. However, some cancers have a defect in HR. This is a paradox, because healthy cells do not survive without HR, while these cancer cells apparently survive without HR. BRCA1 and BRCA2 are two important genes in HR and a BRCA1/2 mutation is associated with an increased risk to develop breast and ovarian cancer. In this thesis, models are used in which a BRCA1/2 defect is induced to study an HR defect in cancer. Since 2013, PARP inhibitors have been approved for the treatment of patients with BRCA1/2-mutated breast and ovarian cancer. However, a defect in HR can also be caused by other gene mutations and these patients could also benefit from PARP inhibitors. In this thesis, a test is validated to select the right patients for PARP inhibitor treatment. The working mechanisms of PARP inhibitors are also being investigated to make treatment even more effective. In addition, the immune system plays an important role in cancers with an HR defect. Some of these mechanisms are described and investigated

    Uncovering rare genetic variants predisposing to coeliac disease

    Get PDF
    PhDCoeliac disease is a common (1% prevalence) inflammatory disease of the small intestine, involving the role of tissue transglutaminase and HLA-­‐DQ binding immuno-­‐dominant wheat peptides. The disease is highly heritable, however, at most only 40% of this heritability is explained by HLA-­‐DQ and risk variants from genome wide association and fine mapping studies. The hypothesis of the research in this thesis is that rare (minor allele frequency <0.5%) mutations of large effect size (odds ratios ~2 – 5) exist, especially in multiply affected pedigrees, which account for the missing heritability of disease. NimbleGen exome capture and Illumina GAIIx high throughput sequencing was performed in 75 coeliac disease individuals from 55 multiply affected families. Candidate genes were chosen from various analytical strategies: linkage, shared variants between multiple related subjects and gene burden tests for multiple potentially causal variants. Highly multiplexed amplicon sequencing, using Fluidigm technology, of all RefSeq exons from 24 candidate genes in 2,304 coeliac cases and 2,304 controls was performed to locate further rare variation. Gene burden tests on a highly stringent post quality control dataset identified no significant associations (P<1x10-­‐3) at the resequenced candidate genes. The strategy of sequencing multiply affected families, and deep follow up of candidate genes, has not identified new disease risk mutations. Common variants (and other factors, e.g. environmental) may instead account for familial clustering in this common autoimmune diseas

    Transcription Factor-Mediated Epigenetic Regulation in the Healthy Brain and Neurological Disease

    Get PDF
    Proper cellular development and function is a complex process established by elaborate gene expression networks. These networks are regulated by epigenetic processes, which alter chromatin states and coordinate the binding of transcription factors (TFs) to regulatory elements (REs), such as enhancers, across the genome to facilitate gene expression. It follows then that a major experimental effort is to profile and understand the binding patterns of TFs to REs in various cellular types and contexts. Critically however, current TF profiling techniques are limited in their abilities to profile TF occupancy in targeted cellular populations and temporal windows, hindering investigations into epigenetic control in complex, multicellular systems, such as the brain. This dissertation focuses on two related areas: firstly, the design of new tools for profiling TF genome occupancy in the mouse brain in specific cellular populations and time periods, and secondly, investigating TF-mediated mechanisms of disease pathogenesis in animal models. In Chapter 2, we describe the development of a novel, viral-mediated method, termed adeno-associated virus (AAV) calling cards, for profiling binding sites of TFs across the genome in the live mouse brain. The AAV calling cards approach allows unique access to TF occupancy information that is inaccessible with other existing techniques, including cell type specificity (through Cre-mediated conditional expression) and historical binding (through longitudinal occupancy recording). Then, in Chapters 3 and 4, we apply this new technique to mouse models to investigate epigenetic misregulation in disease. Previous studies have demonstrated that a large portion of genetic variation associated with cellular dysfunction or disease exists in TF-bound enhancers, demonstrating the criticality of proper TF binding in maintaining cellular homeostasis. However, whether these elements are misregulated more broadly in disease contexts is unclear. In Chapter 3, we apply AAV calling cards to a model of acute seizure and uncover aberrant epigenetic regulation which is predictive of phenotypic outcomes. Particularly important in this study is the ability of AAV calling cards to record and integrate historical TF binding information, allowing linkage of antecedent epigenetic events to eventual seizure outcomes. Here, we longitudinally recorded prodromal enhancer activity to identify loci which are predictive of seizure severity. Next, in Chapter 4, we investigate epigenetic regulation in animal models and postmortem tissues from individuals with amyotrophic lateral sclerosis (ALS). In this study, we focus on a subset of ALS caused by a large hexanucleotide (G4C2) repeat expansion in the gene chromosome 9 open reading frame 72 (C9orf72), which is the most common genetic cause of ALS (C9ALS). Utilizing AAV calling cards as well as other established epigenomic profiling techniques, we observe broad epigenetic misregulation both in C9ALS mouse models and human tissues at the transcriptional and translational levels. Importantly, the C9ALS mouse models used in this study do not develop motor neuron degeneration or ALS-like phenotypes and were profiled at an early age, suggesting that these changes occur early in the disease process and are likely driven by C9orf72-related pathologic species, such as dipeptide repeat proteins (DPRs). Finally, in Chapter 5 we investigate the characteristic properties of C9orf72-specfic pathologies, including DPRs, in human C9ALS. We probed size and abundance of DNA expansions and DPRs in blood, cerebrospinal fluid, and postmortem tissues from C9ALS and sporadic ALS (sALS) individuals and identified novel correlations of C9ALS patient pathologies with clinical and demographic data. Moving forward, these data will facilitate mechanistic studies and clinical trials aimed at reducing or altering C9ALS pathologies in the central nervous system (CNS). In summary, the body of work detailed here extends our knowledge of TFs in both the healthy and diseased central nervous system (CNS), providing new insights into the role of epigenetic regulation in disease pathogenesis. Further, the establishment of AAV calling cards as a widely applicable epigenomic tool will empower innovative new studies in a variety of tissue and model systems

    Investigating the role of nuclear encoded mitochondrial genes in the onset of type 2 diabetes

    Get PDF
    Mitochondrial dysfunction has long been implicated in Type 2 diabetes (T2D). This rela- tionship appears to be bidirectional, with evidence that mitochondrial dysfunction is both caused by and causal of T2D-related phenotypes. A potential causal role in T2D onset would be supported by evidence of a genetic predisposition to mitochondrial dysfunction, since inherited genetic risk factors precede and contribute to disease onset. Here, a genetic study design is used to investigate the potential role of T2D-associated genetic risk loci (T2D loci) in disrupting mitochondrial function through the altered expression of nuclear- encoded mitochondrial genes (NEMGs). The mitochondria are targeted by multiple T2D drugs and therefore such loci may be informative for effective treatment and prevention measures. The functional cis–genes regulated by T2D loci were identified based on the co-location of T2D loci with adipose tissue expression quantitative trait (eQTL) within a genetic distance of 1 LDU. T2D loci and eQTL were previously mapped using LDU- based gene mapping, which is compared and contrasted in this thesis to other popular tests of association. 50 of the identified T2D cis–genes were NEMGs and implicated a number of pathways in the inherited risk of T2D, including the relevant pathway of branched-chain amino acid catabolism. These same 50 genes were enriched for decreased expression in T2D cases compared to controls in independent gene expression datasets. Compared to the total known NEMGs, the 50 cis-NEMGs showed further enrichment for decreased expression, suggesting that T2D-eQTL co-location may identify specific subsets of causal genes. Finally, a candidate T2D locus associated with the cis–NEMG ACAD11 was fine-mapped using targeted sequence data for 94 T2D cases and 94 controls. Sev- eral candidate causal variants were identified, including two low-frequency haplotypes, one of which contained both an ACAD11 splicing mutation and a mutation predicted to disrupt the observed binding of HNF4A and COUP-TFII within the ACAD11 promoter region.Open Acces

    Studying the effects of genetic factors on the female reproductive lifespan

    Get PDF
    The objective of my research was to investigate the rare and very-rare genetic factors influencing female reproductive ageing in humans using large-scale population exome-sequencing data. Over the past decade, most studies have relied on non-sequencing genomic data, which only allowed analysis of common genomic variants. However, these genome-wide array studies have limitations in capturing the complete range of genetic variation. Consequently, our understanding of the role of rare genomic variants, which may have a significant impact on menopause timing, has been limited. Furthermore, comprehensive studies exploring genetic factors associated with menopause age, particularly early and very early menopause, have been limited by the lack of large-scale sequencing genomic data, such as population-based datasets. Most of the previously published research has been derived from clinical and family studies, and there has been a dearth of population-based studies that can validate and identify novel genomic factors using a cohort of healthy individuals. Consequently, my aim was to utilise population whole-exome sequencing data for the first time to advance our understanding of genomic factors that impact female reproductive lifespan. In Chapter 1, I provide an introduction to the biology of menopause. I emphasise the importance of studying menopause timing and the revolutionary impact of using population sequencing genomic data to improve our understanding of the underlying genomic causes of menopause timing. Chapter 2 comprises analysis focusing on the correlation between bone morphogenetic protein 15 (BMP15) and its previously reported variants in relation to menopause timing. The BMP15 gene and its missense variants have been identified as a potential candidate for premature ovarian insufficiency (POI) based on prior investigations. However, our study revealed no evidence of the previously reported variants being causative factors for POI. Furthermore, when conducting a gene burden association test, we found no significant association between various types of BMP15 variants and early menopause. Chapter 3 builds based on the previous chapter, which presents an in-depth analysis aimed at assessing the penetrance of over 100 genes associated with premature ovarian insufficiency (POI). The findings of this investigation provide limited evidence supporting the existence of autosomal dominant effects in the reported POI genes. Surprisingly, the vast majority of heterozygous effects on these genes were ruled out, with 99.9% of all protein-truncating variants being observed in women with normal reproductive health. However, we did observe evidence of haploinsufficiency effects in certain genes, including TWNK and SOHLH2. Chapter 4 is an exome-wide association study to identify rare genetic variants associated with menopause timing. We identified effects ~5 times larger than previously discovered in analyses of common variants, highlighting protein-coding variants in ETAA1, ZNF518A, PNPLA8, PALB2 and SAMHD1. We found rare loss-of-function variants in the ZNF518A gene, which reduced menopause age by approximately six years. Chapter 5 culminates by assessing the significant contributions made by this study in advancing our comprehension of the variation in genetic risk factors associated with female reproductive lifespan. Additionally, it outlines potential directions for future research in this field, highlighting areas that warrant further exploration and investigation

    Discovering pathways to autism spectrum disorder by using functional and integrative genomics approaches to assess monozygotic twin differences

    Get PDF
    Autism spectrum disorder (ASD) is a common developmental disorder typified by deficits in social communication and stereotyped behaviours. Despite evidence of a strong genetic basis to the disorder, molecular studies have thus far had little success in identifying risk variants or other biomarkers, and presently there is no unified pathomechanistic explanation. Monozygotic (MZ) twins show incomplete concordance in autistic traits, which suggests that alternative risk pathways involving non-shared environmental (NSE) factors could also have an important role to play in ASD. In this thesis, we describe microarray and RNA-seq studies characterising gene expression in a sample of 53 ASD MZ twin pairs from TEDS. The overall aims were to: 1) establish convergent evidence for genes and pathways involved in the etiology of ASD comparing affected and unaffected subjects across the sample 2) to identify those responsive to the environment by examining differences within the discordant pairs. We found a number of genes were differentially expressed including DEPDC1B - the most significant finding in cases vs controls, which also showed consistent down regulation within pairs. We further identified IGHG4, IGHG3, IGHV3-66, HSPA8P14, HSPA13, SLC15A2, and found that these results were enriched for transcriptional control, immune, and PI3K/AKT signalling pathways. We suggest that as these were found to be perturbed in the discordant twins, they could represent ASD risk pathways sensitive to the NSE. Next, we investigated integrative genomics methods for performing meta-dimensional analysis using the expression data along with methylation data on the same cohort. After applying regression-based joint analysis methods, and meta-analysis p-value combination methods to our datasets, a number of genes obtained nominal significance across the datasets, including potential genes of interest: NLGN2, UBE3A, OXTR. We suggest these represent genes with evidence for being functionally relevant to ASD

    Spatial statistical modelling of epigenomic variability

    Get PDF
    Each cell in our body carries the same genetic information encoded in the DNA, yet the human organism contains hundreds of cell types which differ substantially in physiology and functionality. This variability stems from the existence of regulatory mechanisms that control gene expression, and hence phenotype. The field of epigenetics studies how changes in biochemical factors, other than the DNA sequence itself, might affect gene regulation. The advent of high throughput sequencing platforms has enabled the profiling of different epigenetic marks on a genome-wide scale; however, bespoke computational methods are required to interpret these high-dimensional data and investigate the coupling between the epigenome and transcriptome. This thesis contributes to the development of statistical models to capture spatial correlations of epigenetic marks, with the main focus being DNA methylation. To this end, we developed BPRMeth (Bayesian Probit Regression for Methylation), a probabilistic model for extracting higher order methylation features that precisely quantify the spatial variability of bulk DNA methylation patterns. Using such features, we constructed an accurate machine learning predictor of gene expression from DNA methylation and identified prototypical methylation profiles that explain most of the variability across promoter regions. The BPRMeth model, and its algorithmic implementation, were subsequently substantially extended both to accommodate different data types, and to improve the scalability of the algorithm. Bulk experiments have paved the way for mapping the epigenetic landscape, nonetheless, they fall short of explaining the epigenetic heterogeneity and quantifying its dynamics, which inherently occur at the single cell level. Single cell bisulfite sequencing protocols have been recently developed, however, due to intrinsic limitations of the technology they result in extremely sparse coverage of CpG sites, effectively limiting the analysis repertoire to a semi-quantitative level. To overcome these difficulties we developed Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical model that leverages local correlations between neighbouring CpGs and similarity between individual cells to jointly impute missing methylation states, and cluster cells based on their genome-wide methylation profiles. A recent experimental innovation enables the parallel profiling of DNA methylation, transcription and chromatin accessibility (scNMT-seq), making it possible to link transcriptional and epigenetic heterogeneity at the single cell resolution. For the scNMT-seq study, we applied the extended BPRMeth model to quantify cell-to-cell chromatin accessibility heterogeneity around promoter regions and subsequently link it to transcript abundance. This revealed that genes with conserved accessibility profiles are associated with higher average expression levels. In summary, this thesis proposes statistical methods to model and interpret epigenomic data generated from high throughput sequencing experiments. Due to their statistical power and flexibility we anticipate that these methods will be applicable to future sequencing technologies and become widespread tools in the high throughput bioinformatics workbench for performing biomedical data analysis
    • 

    corecore