103 research outputs found

    Biological Role and Disease Impact of Copy Number Variation in Complex Disease

    Get PDF
    In the human genome, DNA variants give rise to a variety of complex phenotypes. Ranging from single base mutations to copy number variations (CNVs), many of these variants are neutral in selection and disease etiology, making difficult the detection of true common or rare frequency disease-causing mutations. However, allele frequency comparisons in cases, controls, and families may reveal disease associations. Single nucleotide polymorphism (SNP) arrays and exome sequencing are popular assays for genome-wide variant identification. To limit bias between samples, uniform testing is crucial, including standardized platform versions and sample processing. Bases occupy single points while copy variants occupy segments. Bases are bi-allelic while copies are multi-allelic. One genome also encodes many different cell types. In this study, we investigate how CNV impacts different cell types, including heart, brain and blood cells, all of which serve as models of complex disease. Here, we describe ParseCNV, a systematic algorithm specifically developed as a part of this project to perform more accurate disease associations using SNP arrays or exome sequencing-generated CNV calls with quality tracking of variants, contributing to each significant overlap signal. Red flags of variant quality, genomic region, and overlap profile are assessed in a continuous score and shown to correlate over 90% with independent verification methods. We compared these data with our large internal cohort of 68,000 subjects, with carefully mapped CNVs, which gave a robust rare variant frequency in unaffected populations. In these investigations, we uncovered a number of loci in which CNVs are significantly enriched in non-coding RNA (ncRNA), Online Mendelian Inheritance in Man (OMIM), and genome-wide association study (GWAS) regions, impacting complex disease. By evaluating thoroughly the variant frequencies in pediatric individuals, we subsequently compared these frequencies in geriatric individuals to gain insight of these variants\u27 impact on lifespan. Longevity-associated CNVs enriched in pediatric patients were found to aggregate in alternative splicing genes. Congenital heart disease is the most common birth defect and cause of infant mortality. When comparing congenital heart disease families, with cases and controls genotyped both on SNP arrays and exome sequencing, we uncovered significant and confident loci that provide insight into the molecular basis of disease. Neurodevelopmental disease affects the quality of life and cognitive potential of many children. In the neurodevelopmental and psychiatric diseases, CACNA, GRM, CNTN, and SLIT gene families show multiple significant signals impacting a large number of developmental and psychiatric disease traits, with the potential of informing therapeutic decision-making. Through new tool development and analysis of large disease cohorts genotyped on a variety of assays, I have uncovered an important biological role and disease impact of CNV in complex disease

    Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics

    Get PDF
    The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPH

    Gene expression analysis and transcriptome evolution in apomicts: a case study in Boechera and Ranunculus

    Get PDF
    Apomixis refers specifically to asexual reproduction through seed in plants. Like other modes of asexual reproduction it has received much attention from evolutionary biologists and has been subject of many studies throughout the last decades. At the same time, it attracts interest from an economic point of view, as the long-term goal to technologically induce apomixis in major crop plants offers the prospect of a potential agricultural revolution. Hence, interests have been growing in the scientific community in order to elucidate the evolution and underlying molecular genetic mechanisms of apomixis. Here I present a multifaceted approach to the problem by (1) the development of biotechnological tools in order to (2) apply molecular evolutionary methods to narrow down the possible causes and consequences of asexual reproduction in plants. In this work, representatives of two genera (Ranunculus L. and Boechera Á. LÖVE & D. LÖVE) were studied in order to advance current apomixis knowledge from different perspectives. In the framework of a microarray based transcriptomic analysis of ovules extracted from sexual and apomictic Boechera, a list of housekeeing genes (HKGs) was selected based on a stability of expression analysis subsequently conducted in different vegetative and reproductive tissues of apomictic and sexual species. Using a GeNorm algorithm, different combinations of HKGs were identified, including Ribosomal Subunit 18 (BoechRPS18), Elongation Factor Alpha 1 (BoechEfα1), Actine 2 (BoechACT2) and Ubiquitine (BoechUBQ), based on their pairwise stability in ovules, anthers, and vegetative tissue in apomictic and sexual species. These genes, specifically chosen to be reproduction mode- and tissue-specific, have subsequently been used for normalization in the experimental validation of two candidates genes related to apomixis in Boechera Next, molecular evolutionary causes and consequences of apomixis were investigated by analyzing the transcriptomic effects of asexual reproduction and its correlated traits (i.e. hybridization, polyploidy and mutation accumulation). Flower-specific RNA from sexual and apomictic species of the wild apomictic Ranunculus auricomus complex was used for high throughput transcriptome sequencing (RNAseq). The first de novo assembled transcriptome for these species was used as a reference sequence for mining Single Nucleotide Polymorphism (SNP) and insertion-deletion xii (indel) polymorphisms. The data were further used to design and manufacture a custom 3 x 1.4 million spot expression microarray. Comparative SNP analysis between apomictic and sexual individuals (specifically, two apomicts from two populations and three sexuals from three populations) corroborated the hybrid origin of apomictic Ranunculus, as proposed by Paun et al. (2006b), and could furthermore elucidate the Pleistocene origin and subsequent divergence of the apomictic individuals. In addition, sites of divergent selection were detected with the analysis of non-synonymous (dN) to synonymous (dS) substitution rates, strengthening the idea of rapid divergence in the hybrids. Finally, the custom microarray was used for the hybridization of RNA from live-microdissected ovules (four developmental stages) from the three apomictic and four sexual individuals used in the SNP analysis. The comparative stage specific transcriptome analysis was used to detect stage specific differentially expressed genes in ovules, in order to identify signatures of apomixis and to produce a list of potential candidates underlying the reproductive switch. 555 stage specific genes were found to be differentially expressed throughout ovule development, and eight genes showed a significant shift in expression pattern throughout ovule development in apomicts compared to sexuals. A further classification was conducted following the predictions made from Nogler’s extensive work in Ranunculus in which different genetic factors were proposed for the induction and penetrance of apomixis. In that light, differentially expressed homoeologous genes were classified into three categories based on their relative expression in apomicts compared to their phylogenetic sexual parent, with the final aim of classifying the number of genes potentially responsible for apomixis. In doing so, we have provided a solid base for future studies in wild (i.e with little or no genetic information available) Ranunculus species. By developing biotechnical tools for their study, identifying genes potentially involved in the establishment of apomixis, and analyzing their evolutionary history, this work presents an important step towards a more comprehensive understanding of the processes and patterns connected to apomixis in model and non-model plants, and has far-reaching potential for agricultural use

    Computational approaches in infectious disease research: Towards improved diagnostic methods

    Get PDF
    Thesis advisor: Kenneth WilliamsDue to overuse and misuse of antibiotics, the global threat of antibiotic resistance is a growing crisis. Three critical issues surrounding antibiotic resistance are the lack of rapid testing, treatment failure, and evolution of resistance. However, with new technology facilitating data collection and powerful statistical learning advances, our understanding of the bacterial stress response to antibiotics is rapidly expanding. With a recent influx of omics data, it has become possible to develop powerful computational methods that make the best use of growing systems-level datasets. In this work, I present several such approaches that address the three challenges around resistance. While this body of work was motivated by the antibiotic resistance crisis, the approaches presented here favor generalization, that is, applicability beyond just one context. First, I present ShinyOmics, a web-based application that allow visualization, sharing, exploration and comparison of systems-level data. An overview of transcriptomics data in the bacterial pathogen Streptococcus pneumoniae led to the hypothesis that stress-susceptible strains have more chaotic gene expression patterns than stress-resistant ones. This hypothesis was supported by data from multiple strains, species, antibiotics and non-antibiotic stress factors, leading to the development of a transcriptomic entropy based, general predictor for bacterial fitness. I show the potential utility of this predictor in predicting antibiotic susceptibility phenotype, and drug minimum inhibitory concentrations, which can be applied to bacterial isolates from patients in the near future. Predictors for antibiotic susceptibility are of great value when there is large phenotypic variability across isolates from the same species. Phenotypic variability is accompanied by genomic diversity harbored within a species. I address the genomic diversity by developing BFClust, a software package that for the first time enables pan-genome analysis with confidence scores. Using pan-genome level information, I then develop predictors of essential genes unique to certain strains and predictors for genes that acquire adaptive mutations under prolonged stress exposure. Genes that are essential offer attractive drug targets, and those that are essential only in certain strains would make great targets for very narrow-spectrum antibiotics, potentially leading the way to personalized therapies in infectious disease. Finally, the prediction of adaptive outcome can lead to predictions of future cross-resistance or collateral sensitivities. Overall, this body of work exemplifies how computational methods can complement the increasingly rapid data generation in the lab, and pave the way to the development of more effective antibiotic stewardship practices.Thesis (PhD) — Boston College, 2020.Submitted to: Boston College. Graduate School of Arts and Sciences.Discipline: Biology

    The best treatment for every patient: New algorithms to predict treatment benefit in cancer using genomics and transcriptomics

    Get PDF
    Many cancer drugs only benefit a subset of the patients that receive them. Because these drugs are often associated with serious side effects, it is very important to be able to predict who will benefit and who will not. This thesis presents several algorithms that can build models that can predict whether a patient will benefit more from a drug of interest than an alternative treatment. We show these algorithms can be used for various types of cancer and different datatype

    CRISPRing the Human Genome for Functional Regulatory Elements

    Get PDF
    The sequence of DNA is a code that contains all the information that is required for life (as we know it). DNA is stored inside the nucleus of cells and its sequence is replicated during cell division to ensure that the genetic information is transmitted to the daughter cells. The information contained in DNA is copied into RNA by a process called transcription. RNA acts as a messenger (mRNA) to carry the information between the nucleus and the cytoplasm, where it is used as a template to produce proteins through a process called translation. Proteins are the main effectors of all biological functions in the cell. However, the information required to make proteins (called “coding DNA sequence”) comprises only a small portion (~2%) of the entire human genome sequence. For several decades, it was generally accepted that the remaining 98% of the genome sequence had no biological function and, because of that, it was dubbed “junk DNA”. The discovery of non-coding DNA sequences that control the expression of genes challenged this idea, and revealed that there is biological function beyond protein-coding sequences. These non-coding sequences are called “regulatory elements” and they are classified into four classes according to their function: promoters, enhancers, insulators and silencers. Among them, enhancers play a critical role in activating the expression of genes in response to intra- and extra-cellular stimuli – which is essential for the development of complex organisms. Previous studies suggest that the human genome might contain more than one million enhancers – a much higher number compared to the

    Statistical Population Genomics

    Get PDF
    This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions

    EVOLUTION OF WOLBACHIA SYMBIOSIS IN ARTHOPODS AND NEMATODES: INSIGTHS FROM PHYLOGENETICS AND COMPARATIVE GENOMICS

    Get PDF
    Wolbachia is a bacterium observed in relationship with a wide array of arthropod and nematode species. This is an obligate intracellular symbiont, maternally transferred through the host oocytes. In arthropods Wolbachia is able to manipulate reproduction, using multiple strategies to increase the fitness of infected females. In nematodes the bacterium has a fundamental, and not completely understood, role in larvae development. Wolbachia infects ~50% of all the arthropod species worldwide, and in some of them it can be considered the most important sex determination factor. In contrast, Wolbachia presence is much more limited in nematodes, being present in a limited number of filarial species. The taxonomic status within the Wolbachia genus is highly debated, with the current classification dividing all strains in 14 'supergroups'. During my Ph.D. I studied the evolution of the symbiotic relationship between Wolbachia and its arthropod and nematode hosts, using genomic approaches. Indeed, during the evolution of the Wolbachia-host relationship, genetic signs have been left in the Wolbachia genomes. I worked to identify these genomic signs and to evaluate them within an evolutionary frame, in order to obtain a better understanding of how the Wolbachia-host symbiosis evolved. The work here presented can be organized in three major sections: i) the sequencing and analysis of the genome of the filarial nematode Dirofilaria immitis and of its symbiotic Wolbachia strain, wDi; ii) the sequencing of the genome of Wolbachia endosymbiont of Litomosoides sigmodontis, and the phylogenomic reconstruction of the Wolbachia supergroups A-D; iii) a comparison of the genomes of 26 Wolbachia strains spanning the A to F supergroups. Here a schematic summary of the results is reported: 1. Dirofilaria immitis and the Wolbachia symbiont wDi show metabolic complementarity for fundamental pathways 2. The metabolic pathway for the synthesis of wDi membrane proteins is one evolving the fastest in the genome of the bacterium 3. Nematode Wolbachia belonging to supergroups C and D are monophyletic, indicating that a single transition to mutualism likely occurred during the evolution of Wolbachia 4. Wolbachia strains of the C supergroup show genomic features that are unique in the genus, such as a much higher level of synteny compared to the rest of Wolbachia supergroups, and a newly generated pattern of GC skew curves, typically observed in free-living bacteria genomes 5. Wolbachia supergroups show conserved genomic features, which suggest genomic isolation among them

    CRISPRing the Human Genome for Functional Regulatory Elements

    Get PDF
    • 

    corecore