5 research outputs found
Recommended from our members
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.
MOTIVATION: Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections. RESULTS: We evaluate HyperGen s sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy. AVAILABILITY: A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen
Applied Randomized Algorithms for Efficient Genomic Analysis
The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space.
Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data.
We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling.
We combined these advances with hardware-based optimizations and
incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware
J Open Source Softw
CC999999/ImCDC/Intramural CDC HHSUnited States/2022-08-16T00:00:00Z35978566PMC93804451179
Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent
Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66–0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo
Role of mobile genetic elements in the global network of bacterial horizontal gene transfer
Many bacteria can exchange genetic material through horizontal gene transfer (HGT) mediated by plasmids and plasmid-borne transposable elements. One grave consequence of this exchange is the rapid spread of antibiotic resistance determinants among bacterial communities across the world. In this thesis, I make use of large datasets of publicly available bacterial genomes and various analytical approaches to improve our understanding of the nature and the impact of HGT at a global scale. In the first part, I study the population structure and dynamics of over 10,000 bacterial plasmids. By reconstructing and analysing a network of plasmids based on their shared k-mer content, I was able to sort them into biologically meaningful clusters. This network-based analysis allowed me to make further inferences into global network of HGT and opened up prospect for a natural and exhaustive classification framework of bacterial plasmids. The second part focuses on global spreading of blaNDM – an important antibiotic resistance gene. To this end, I compiled a dataset of over 6000 bacterial genomes harbouring this element and developed a novel computational approach to track structural variants surrounding blaNDM across bacterial genomes. This facilitated identification of prevalent genomic contexts of blaNDM and reconstruction of key mobile genetic elements and events which led to its global dissemination. Taken together, my results highlight transposable elements as the main drivers of HGT at broad phylogenetic and geographical scales with plasmid exchange being much more spatially restricted due to the adaptation to specific bacterial hosts and evolutionary pressures