9 research outputs found

    Strategies for analyzing bisulfite sequencing data

    Get PDF
    DNA methylation is one of the main epigenetic modifications in the eukaryotic genome; it has been shown to play a role in cell-type specific regulation of gene expression, and therefore cell-type identity. Bisulfite sequencing is the gold-standard for measuring methylation over the genomes of interest. Here, we review several techniques used for the analysis of high-throughput bisulfite sequencing. We introduce specialized short-read alignment techniques as well as pre/post-alignment quality check methods to ensure data quality. Furthermore, we discuss subsequent analysis steps after alignment. We introduce various differential methylation methods and compare their performance using simulated and real bisulfite sequencing datasets. We also discuss the methods used to segment methylomes in order to pinpoint regulatory regions. We introduce annotation methods that can be used for further classification of regions returned by segmentation and differential methylation methods. Finally, we review software packages that implement strategies to efficiently deal with large bisulfite sequencing datasets locally and we discuss online analysis workflows that do not require any prior programming skills. The analysis strategies described in this review will guide researchers at any level to the best practices of bisulfite sequencing analysis

    Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools

    Get PDF
    Background. Large-scale bisulfite treatment and short reads sequencing technology allow comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. Aligning bisulfite short reads to a reference genome has been a challenging task. We compared five bisulfite short read mapping tools, BSMAP, Bismark, BS-Seeker, BiSS, and BRAT-BW, representing two classes of mapping algorithms (hash table and suffix/prefix tries). We examined their mapping efficiency (i.e., the percentage of reads that can be mapped to the genomes), usability, running time, and effects of changing default parameter settings using both real and simulated reads. We also investigated how preprocessing data might affect mapping efficiency. Conclusion. Among the five programs compared, in terms of mapping efficiency, Bismark performs the best on the real data, followed by BiSS, BSMAP, and finally BRAT-BW and BS-Seeker with very similar performance. If CPU time is not a constraint, Bismark is a good choice of program for mapping bisulfite treated short reads. Data quality impacts a great deal mapping efficiency. Although increasing the number of mismatches allowed can increase mapping efficiency, it not only significantly slows down the program, but also runs the risk of having increased false positives. Therefore, users should carefully set the related parameters depending on the quality of their sequencing data

    Erregulatzaile genetiko eta epigenetikoen identifikazioa eritasun zeliakoan hurbilketa konputazional eta esperimentalen bidez

    Get PDF
    181 p. (eng.) 184 p. (eus.)La enfermedad celíaca (EC) es una enfermedad autoinmune que se desarrolla en personas con susceptibilidad genética. Los haplotipos HLA-DQ2 y HLA-DQ8 representan alrededor del 40% de la contribución genética a la EC. En total, junto con otras variaciones genéticas comunes identificadas, podemos explicar el 50% de la heritabilidad de la enfermedad. Sin embargo, otras capas de información genómica independientes de la variación de la secuencia del ADN también podrían contribuir a la patogénesis de la EC.En esta tesis, los datos públicos de diferentes capas ómicas se han utilizado para identificar los mecanismos genéticos y epigenéticos de regulación génica que podrían estar involucrados en la patogénesis de la EC. Los experimentos previos de microarrays de expresión de nuestro grupo, los resultados del proyecto Immunochip, la información publicada sobre TADs y los datos de metilación y expresión generados recientemente se han recopilado y vuelto a analizar. Esto nos ha permitido identificar nuevos elementos reguladores y regiones genómicas involucradas en el desarrollo de la CD. Así, hemos podido observar que la gliadina influye en los cambios de co-expresión observados en la ECmediante los factores de transcripción CREB1 y IRF1. Además, la estructura 3D del genoma podría regular la co-expresión local, y la gliadina provoca cambios agudos en la metilación del ADN incluso en individuos no celíacos. En conclusión, los reguladores genéticos y epigenéticos identificados podrían estar involucrados en la patogénesis de la enfermedad y constituir nuevas dianas de intervención

    Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics

    Get PDF
    The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPH

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Erregulatzaile genetiko eta epigenetikoen identifikazioa eritasun zeliakoan hurbilketa konputazional eta esperimentalen bidez

    Get PDF
    181 p. (eng.) 184 p. (eus.)La enfermedad celíaca (EC) es una enfermedad autoinmune que se desarrolla en personas con susceptibilidad genética. Los haplotipos HLA-DQ2 y HLA-DQ8 representan alrededor del 40% de la contribución genética a la EC. En total, junto con otras variaciones genéticas comunes identificadas, podemos explicar el 50% de la heritabilidad de la enfermedad. Sin embargo, otras capas de información genómica independientes de la variación de la secuencia del ADN también podrían contribuir a la patogénesis de la EC.En esta tesis, los datos públicos de diferentes capas ómicas se han utilizado para identificar los mecanismos genéticos y epigenéticos de regulación génica que podrían estar involucrados en la patogénesis de la EC. Los experimentos previos de microarrays de expresión de nuestro grupo, los resultados del proyecto Immunochip, la información publicada sobre TADs y los datos de metilación y expresión generados recientemente se han recopilado y vuelto a analizar. Esto nos ha permitido identificar nuevos elementos reguladores y regiones genómicas involucradas en el desarrollo de la CD. Así, hemos podido observar que la gliadina influye en los cambios de co-expresión observados en la ECmediante los factores de transcripción CREB1 y IRF1. Además, la estructura 3D del genoma podría regular la co-expresión local, y la gliadina provoca cambios agudos en la metilación del ADN incluso en individuos no celíacos. En conclusión, los reguladores genéticos y epigenéticos identificados podrían estar involucrados en la patogénesis de la enfermedad y constituir nuevas dianas de intervención
    corecore