103 research outputs found

    Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads

    Get PDF
    International audienceAbstractBackground The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them.ResultsThe results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99–111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644–652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086–1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134–1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods

    Cloning of the first cDNA encoding a putative CCRFamide precursor: identification of the brain, eyestalk ganglia, and cardiac ganglion as sites of CCRFamide expression in the American lobster, Homarus americanus

    Get PDF
    Over the past decade, many new peptide families have been identified via in silico analyses of genomic and transcriptomic datasets. While various molecular and biochemical methods have confirmed the existence of some of these new groups, others remain in silico discoveries of computationally assembled sequences only. An example of the latter are the CCRFamides, named for the predicted presence of two pairs of disulfide bonded cysteine residues and an amidated arginine-phenylalanine carboxyl-terminus in family members, which have been identified from annelid, molluscan, and arthropod genomes/transcriptomes, but for which no precursor protein-encoding cDNAs have been cloned. Using routine transcriptome mining methods, we identified four Homarus americanus (American lobster) CCRFamide transcripts that share high sequence identity across the predicted open reading frames but more limited conservation in their 5′ terminal ends, suggesting the Homarus gene undergoes alternative splicing. RT-PCR profiling using primers designed to amplify an internal fragment common to all of the transcripts revealed expression in the supraoesophageal ganglion (brain), eyestalk ganglia, and cardiac ganglion. Variant specific profiling revealed a similar profile for variant 1, eyestalk ganglia specific expression of variant 2, and an absence of variant 3 expression in the cDNAs examined. The broad distribution of CCRFamide transcript expression in the H. americanus nervous system suggests a potential role as a locally released and/or circulating neuropeptide. This is the first report of the cloning of a CCRFamide-encoding cDNA from any species, and as such, provides the first non-in silico support for the existence of this invertebrate peptide family

    Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics

    Get PDF
    The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPH

    A Family of Tree-Based Generators for Bubbles in Directed Graphs

    Get PDF
    International audienceBubbles are pairs of internally vertex-disjoint (s, t)-paths in a directed graph. In de Bruijn graphs built from reads of RNA and DNA data, bubbles represent interesting biological events, such as alternative splicing (AS) and allelic differences (SNPs and indels). However, the set of all bubbles in a de Bruijn graph built from real data is usually too large to be efficiently enumerated and analysed in practice. In particular, despite significant research done in this area, listing bubbles still remains the main bottleneck for tools that detect AS events in a reference-free context. Recently, in [1] the concept of a bubble generator was introduced as a way for obtaining a compact representation of the bubble space of a graph. Although this generator was quite effective in finding AS events, preliminary experiments showed that it is about 5 times slower than state-of-art methods. In this paper we propose a new family of bubble generators which improve substantially on the previous generator: generators in this new family are about two orders of magnitude faster and are still able to achieve similar precision in identifying AS events. To highlight the practical value of our new generators, we also report some experimental results on a real dataset

    A family of tree-based generators for bubbles in directed graphs

    Get PDF
    6sìopenBubbles are pairs of internally vertex-disjoint (s, t)-paths in a directed graph. In de Bruijn graphs built from reads of RNA and DNA data, bubbles represent interesting biological events, such as alternative splicing (AS) and allelic differences (SNPs and indels). However, the set of all bubbles in a de Bruijn graph built from real data is usually too large to be efficiently enumerated and analysed in practice. In particular, despite significant research done in this area, listing bubbles still remains the main bottleneck for tools that detect AS events in a reference-free context. Recently, in the concept of a bubble generator was introduced as a way for obtaining a compact representation of the bubble space of a graph. Although this bubble generator was quite effective in finding AS events, preliminary experiments showed that it is about 5 times slower than state-of-art methods. In this paper we propose a new family of bubble generators which improve substantially on previous work: bubble generators in this new family are about two orders of magnitude faster and are still able to achieve similar precision in identifying AS events. To highlight the practical value of our new bubble generators, we also report some experimental results on real datasets.openAcuña, Vicente; Soares de Lima, Leandro Ishi; Italiano, Giuseppe F.; Pepè Sciarria, Luca; Sagot, Marie-France; Sinaimeri, BlerinaAcuña, Vicente; Soares de Lima, Leandro Ishi; Italiano, Giuseppe F.; Pepè Sciarria, Luca; Sagot, Marie-France; Sinaimeri, Blerin

    Compressed weighted de Bruijn graphs

    Get PDF
    We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest

    Improving anti-cancer therapies through a better identification and characterization of non-canonical MHC-I associated peptides

    Full text link
    Increasing evidence of non-canonical protein translation has sparked interest in their identification and characterization for use in immunotherapy. In addition, recent studies on the repertoire of major histocompatibility complex class I (MHC-I) associated peptides (MAPs or immunopeptidome), have suggested that MAPs derived from these translations are potential targets for cancer immunotherapy. Therefore, the aim of this study was to assess the impact of these MAPs in cancer by developing methods to facilitate their identification and their validation as potential targets for immunotherapy. To facilitate the identification of non-canonical proteins, we developed Ribo-db, a proteogenomic approach that combines RNA sequencing, ribosome profiling and mass spectrometry. This approach enables the generation of specific databases aimed at including protein diversity. The use of Ribo-db to analyze diffuse large B-cell lymphoma (DLBCL) samples revealed that approximately 10% of MAPs were derived from non-canonical proteins. These proteins had distinct properties compared to those derived from canonical proteins. They had shorter lengths and lower stability, but greater efficiency in generating MAPs. Importantly, we found limited overlap between the non-canonical proteins detected in the immunopeptidome and those detected in the whole proteome suggesting the existence of two distinct non-canonical protein repertoires. Knowing that non-canonical MAPs can be effective targets for cancer immunotherapy, we developed BamQuery, a tool to assess their expression in tissues to determine whether they can be used in a vaccine. BamQuery aims to predict the probability of MHC-I presentation of each peptide in different tissues based on its RNA expression. Using BamQuery, we found that previously identified tumor antigens (TA) would be highly expressed in healthy tissues, making them poor candidates for immunotherapy. In addition, we also identified highly potential immunotherapeutic targets in DLBCL that were derived from non-canonical translations. These targets showed promising as they were poorly expressed in normal tissues but highly expressed and shared in tumor samples. Thus, BamQuery proved to be a useful tool for identifying and prioritizing potential immunotherapeutic targets. Overall, our research indicated that non-canonical regions of the genome increase the diversity of MAPs that can be recognized by T cells. Furthermore, the expression of MAPs in tissues can be used as a predictor of their presentation to MHC I to identify reliable targets for immunotherapy, for which BamQuery is an effective tool.Les preuves de plus en plus nombreuses de la traduction des protéines non canonique ont suscité l'intérêt pour leur identification et leur caractérisation en vue de leur utilisation dans les immunothérapies. En outre, des études récentes sur le répertoire des peptides associés au complexe majeur d'histocompatibilité de classe I (CMH-I, connus sous le nom de MAPs ou immunopeptidome), ont suggéré que les MAPs dérivés de ces traductions sont des cibles potentielles pour l'immunothérapie du cancer. L'objectif de cette étude était donc d'évaluer l'impact de ces MAP dans le cancer en développant des méthodes pour faciliter leur identification et leur validation en tant que cibles potentielles pour l'immunothérapie. Afin de faciliter l'identification des protéines non canoniques, nous avons développé Ribodb, une approche protéogénomique qui combine le séquençage de l'ARN, le profilage ribosomal et la spectrométrie de masse. Cette approche permet de générer des bases de données spécifiques visant à inclure la diversité des protéines. Notre analyse avec Ribo-db d'échantillons de lymphome diffus à grandes cellules B (DLBCL) a révélé qu'environ 10% des MAP étaient dérivés de protéines non canoniques. Ces protéines avaient des propriétés distinctes par rapport à celles dérivées de protéines canoniques. Elles étaient plus courtes et avaient une stabilité plus faible, mais une plus grande efficacité dans la génération de MAPs. Fait important, nous avons constaté un chevauchement limité entre les protéines non canoniques détectées dans l'immunopeptidome et celles détectées dans le proteome entier, ce qui suggère l'existence de deux répertoires distincts de protéines non canoniques. Sachant que les MAP non canoniques peuvent être des cibles efficaces pour l'immunothérapie du cancer, nous avons développé BamQuery, un outil permettant d'évaluer leur expression dans les tissus afin de déterminer s'ils peuvent être utilisés dans un vaccin. BamQuery vise à prédire la probabilité de présentation au CMH-I de chaque MAP dans différents tissus sur la base de son expression ARN. En utilisant BamQuery, nous avons découvert que des antigènes tumoraux (TA) précédemment identifiés seraient fortement exprimés dans les tissus sains, ce qui en fait de mauvais candidats pour l'immunothérapie. En outre, nous avons également ii identifié des cibles immunothérapeutiques très potentielles dans DLBCL qui étaient dérivées de traductions non canoniques. Ces cibles se sont révélées prometteuses car elles étaient peu exprimées dans les tissus normaux mais fortement exprimées et partagées dans les échantillons tumoraux. Ainsi, BamQuery s'est avéré être un outil utile pour identifier et hiérarchiser les cibles immunothérapeutiques potentielles. Dans l'ensemble, nos recherches ont indiqué que les régions non canonique du génome augmentent la diversité des MAPs qui peuvent être reconnues par les cellules T. De plus, l'expression des MAPs dans les tissus peut être utilisée comme un prédicteur de leur présentation au CMH I afin d'identifier des cibles fiables pour l'immunothérapie, ce pour quoi BamQuery est un outil efficace

    On Bubble Generators in Directed Graphs

    Get PDF
    International audienceBubbles are pairs of internally vertex-disjoint (s, t)-paths with applications in the processing of DNA and RNA data. For example, enumerating alternative splicing events in a reference-free context can be done by enumerating all bubbles in a de Bruijn graph built from RNA-seq reads [16]. However, listing and analysing all bubbles in a given graph is usually unfeasible in practice, due to the exponential number of bubbles present in real data graphs. In this paper, we propose a notion of a bubble generator set, i.e. a polynomial-sized subset of bubbles from which all the others can be obtained through the application of a specific symmetric difference operator. This set provides a compact representation of the bubble space of a graph, which can be useful in practice since some pertinent information about all the bubbles can be more conveniently extracted from this compact set. Furthermore, we provide a polynomial-time algorithm to decompose any bubble of a graph into the bubbles of such a generator in a tree-like fashion
    • …
    corecore