5 research outputs found

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware

    J Open Source Softw

    Get PDF
    CC999999/ImCDC/Intramural CDC HHSUnited States/2022-08-16T00:00:00Z35978566PMC93804451179

    Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent

    Get PDF
    Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66–0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo

    Role of mobile genetic elements in the global network of bacterial horizontal gene transfer

    Get PDF
    Many bacteria can exchange genetic material through horizontal gene transfer (HGT) mediated by plasmids and plasmid-borne transposable elements. One grave consequence of this exchange is the rapid spread of antibiotic resistance determinants among bacterial communities across the world. In this thesis, I make use of large datasets of publicly available bacterial genomes and various analytical approaches to improve our understanding of the nature and the impact of HGT at a global scale. In the first part, I study the population structure and dynamics of over 10,000 bacterial plasmids. By reconstructing and analysing a network of plasmids based on their shared k-mer content, I was able to sort them into biologically meaningful clusters. This network-based analysis allowed me to make further inferences into global network of HGT and opened up prospect for a natural and exhaustive classification framework of bacterial plasmids. The second part focuses on global spreading of blaNDM – an important antibiotic resistance gene. To this end, I compiled a dataset of over 6000 bacterial genomes harbouring this element and developed a novel computational approach to track structural variants surrounding blaNDM across bacterial genomes. This facilitated identification of prevalent genomic contexts of blaNDM and reconstruction of key mobile genetic elements and events which led to its global dissemination. Taken together, my results highlight transposable elements as the main drivers of HGT at broad phylogenetic and geographical scales with plasmid exchange being much more spatially restricted due to the adaptation to specific bacterial hosts and evolutionary pressures
    corecore