12 research outputs found

    Helmsman: fast and efficient mutation signature analysis for massive sequencing datasets

    Full text link
    Abstract Background The spectrum of somatic single-nucleotide variants in cancer genomes often reflects the signatures of multiple distinct mutational processes, which can provide clinically actionable insights into cancer etiology. Existing software tools for identifying and evaluating these mutational signatures do not scale to analyze large datasets containing thousands of individuals or millions of variants. Results We introduce Helmsman, a program designed to perform mutation signature analysis on arbitrarily large sequencing datasets. Helmsman is up to 300 times faster than existing software. Helmsman’s memory usage is independent of the number of variants, resulting in a small enough memory footprint to analyze datasets that would otherwise exceed the memory limitations of other programs. Conclusions Helmsman is a computationally efficient tool that enables users to evaluate mutational signatures in massive sequencing datasets that are otherwise intractable with existing software. Helmsman is freely available at https://github.com/carjed/helmsman .https://deepblue.lib.umich.edu/bitstream/2027.42/146537/1/12864_2018_Article_5264.pd

    genomepy: genes and genomes at your fingertips

    Full text link
    Analyzing a functional genomics experiment, such as ATAC-, ChIP- or RNA-sequencing, requires reference data including a genome assembly and gene annotation. These resources can generally be retrieved from different organizations and in different versions. Most bioinformatic workflows require the user to supply this genomic data manually, which can be a tedious and error-prone process. Here we present genomepy, which can search, download, and preprocess the right genomic data for your analysis. Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and compare available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults. Additional supporting data can be automatically generated or downloaded, such as aligner indexes, genome metadata and blacklists. Genomepy is freely available at https://github.com/vanheeringen-lab/genomepy under the MIT license and can be installed through pip or bioconda

    Diet assessment of two land planarian species using high-throughput sequencing data

    Full text link
    Geoplanidae (Platyhelminthes: Tricladida) feed on soil invertebrates. Observations of their predatory behavior in nature are scarce, and most of the information has been obtained from food preference experiments. Although these experiments are based on a wide variety of prey, this catalog is often far from being representative of the fauna present in the natural habitat of planarians. As some geoplanid species have recently become invasive, obtaining accurate knowledge about their feeding habits is crucial for the development of plans to control and prevent their expansion. Using high throughput sequencing data, we perform a metagenomic analysis to identify the in situ diet of two endemic and codistributed species of geoplanids from the Brazilian Atlantic Forest: Imbira marcusi and Cephaloflexa bergi. We have tested four different methods of taxonomic assignment and find that phylogenetic-based assignment methods outperform those based on similarity. The results show that the diet of I. marcusi is restricted to earthworms, whereas C. bergi preys on spiders, harvestmen, woodlice, grasshoppers, Hymenoptera, Lepidoptera and possibly other geoplanids. Furthermore, both species change their feeding habits among the different sample locations. In conclusion, the integration of metagenomics with phylogenetics should be considered when establishing studies on the feeding habits of invertebrates

    The Development of a Functional Annotation Pipeline to Characterise Metagenome-Assembled Genomes of Microorganisms Found in Anaerobic Digestion

    Get PDF
    Anaerobic digestion involves the conversion of organic waste into biogas and biofertilisers. Anaerobic digesters are commonly found within the wastewater treatment process in the UK, converting waste sludge into methane. Higher yields of methane are required for AD to become a favourable renewable energy source. The AD process consists of four steps (hydrolysis, acidogenesis, acetogenesis, and methanogenesis) that are driven by complex microbial communities. Hydrogenotrophs and methanogens are rate-determining factors, highlighting the significance of these microbial communities within these dynamic AD environments. Research into these microbial communities will ultimately result in greater yields of methane in AD. A greater understanding of the microbial communities can be achieved via metagenomics, which involves the study of genomes recovered from environmental samples. Metagenomics involves the use of shotgun sequencing. Environmental DNA is sequenced followed by binning, and assembly into metagenome-assembled genomes (MAGs). Functional annotation is carried out to predict the gene function within the MAGs. However, quality and completeness of MAGs varies greatly due to the nature of shotgun sequencing. Large datasets of metagenomic data require large-scale data manipulation and bioinformatic analysis. Genome annotation pipelines (via workflow management tools e.g. Snakemake) allow automation and ensure reproducibility of the genome annotation. A genome annotation pipeline was developed, using Snakmake, to predict the gene function of MAGs recovered from AD. This pipeline was developed to provide an automated tool to functionally annotate MAGs, in order to discover more about the metabolic processes and relationships between microbes that drive the AD process. A confidence system was devised to indicate the quality of annotations provided by orthology-based tools EggNOG and KofamScan, allowing further analysis of low quality ORFs. Reproducibility and reference databases continue to be limitations of bioinformatic pipelines. However, approximately 50% of ORFs are annotated to a high confidence

    The transcriptional landscape of polyploid wheat

    Get PDF

    Mapping the Landscape of Mutation Rate Heterogeneity in the Human Genome: Approaches and Applications

    Full text link
    All heritable genetic variation is ultimately the result of mutations that have occurred in the past. Understanding the processes which determine the rate and spectra of new mutations is therefore fundamentally important in efforts to characterize the genetic basis of heritable disease, infer the timing and extent of past demographic events (e.g., population expansion, migration), or identify signals of natural selection. This dissertation aims to describe patterns of mutation rate heterogeneity in detail, identify factors contributing to this heterogeneity, and develop methods and tools to harness such knowledge for more effective and efficient analysis of whole-genome sequencing data. In Chapters 2 and 3, we catalog granular patterns of germline mutation rate heterogeneity throughout the human genome by analyzing extremely rare variants ascertained from large-scale whole-genome sequencing datasets. In Chapter 2, we describe how mutation rates are influenced by local sequence context and various features of the genomic landscape (e.g., histone marks, recombination rate, replication timing), providing detailed insight into the determinants of single-nucleotide mutation rate variation. We show that these estimates reflect genuine patterns of variation among de novo mutations, with broad potential for improving our understanding of the biology of underlying mutation processes and the consequences for human health and evolution. These estimated rates are publicly available at http://mutation.sph.umich.edu/. In Chapter 3, we introduce a novel statistical model to elucidate the variation in rate and spectra of multinucleotide mutations throughout the genome. We catalog two major classes of multinucleotide mutations: those resulting from error-prone translesion synthesis, and those resulting from repair of double-strand breaks. In addition, we identify specific hotspots for these unique mutation classes and describe the genomic features associated with their spatial variation. We show how these multinucleotide mutation processes, along with sample demography and mutation rate heterogeneity, contribute to the overall patterns of clustered variation throughout the genome, promoting a more holistic approach to interpreting the source of these patterns. In chapter 4, we develop Helmsman, a computationally efficient software tool to infer mutational signatures in large samples of cancer genomes. By incorporating parallelization routines and efficient programming techniques, Helmsman performs this task up to 300 times faster and with a memory footprint 100 times smaller than existing mutation signature analysis software. Moreover, Helmsman is the only such program capable of directly analyzing arbitrarily large datasets. The Helmsman software can be accessed at https://github.com/carjed/helmsman. Finally, in Chapter 5, we present a new method for quality control in large-scale whole-genome sequencing datasets, using a combination of dimensionality reduction algorithms and unsupervised anomaly detection techniques. Just as the mutation spectrum can be used to infer the presence of underlying mechanisms, we show that the spectrum of rare variation is a powerful and informative indicator of sample sequencing quality. Analyzing three large-scale datasets, we demonstrate that our method is capable of identifying samples affected by a variety of technical artifacts that would otherwise go undetected by standard ad hoc filtering criteria. We have implemented this method in a software package, Doomsayer, available at https://github.com/carjed/doomsayer.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147537/1/jedidiah_1.pd

    The Development and Application of Computational Methods for Genome Annotation

    Get PDF
    Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however, many genomes, especially eukaryotic genomes have not yet been annotated. Ab-initio gene prediction is notoriously hard in eukaryotic genomes due to the sparse gene content and introns interrupting genes. Two more-promising strategies for annotating eukaryotic genomes are RNA-sequencing followed by transcriptome assembly and/or mapping genes from a closely related species. Current transcriptome assembly methods can assemble either short or long RNA-sequencing reads, which each have their own weaknesses that limit assembly accuracy. Additionally, there are no standalone tools that can accurately map gene annotations from one assembly to another. Therefore, in this work we first present hybrid-read transcriptome assembly with StringTie where we combine long and short reads to mitigate the weaknesses of each datatype. We show that hybrid-read assembly achieves better accuracy than long or short-read only assembly on simulated as well as real RNA-sequencing data from human, Mus musculus, and Arabidopsis thaliana. We then introduce Liftoff, which is a standalone tool that can map gene annotations between assemblies of the same or closely related species. As a proof of concept, we map genes between two versions of the human reference genome and then between the human reference genome and the chimpanzee reference genome. We then describe the results of using Liftoff to annotate 3 new reference-quality human genome assemblies and a new assembly of the bread wheat genome. Lastly, we introduce LiftoffTools, which is a toolkit that compares the sequence, synteny, and copy number of genes lifted from one assembly to another
    corecore