9 research outputs found

    Computational Methods for Strain-Level Microbial Detection in Colony and Metagenome Sequencing Data

    No full text
    Metagenomic sequencing is a powerful tool for examining the diversity and complexity of microbial communities. Most widely used tools for taxonomic profiling of metagenomic sequence data allow for a species-level overview of the composition. However, individual strains within a species can differ greatly in key genotypic and phenotypic characteristics, such as drug resistance, virulence and growth rate. Therefore, the ability to resolve microbial communities down to the level of individual strains within a species is critical to interpreting metagenomic data for clinical and environmental applications, where identifying a particular strain, or tracking a particular strain across a set of samples, can help aid in clinical diagnosis and treatment, or in characterizing yet unstudied strains across novel environmental locations. Recently published approaches have begun to tackle the problem of resolving strains within a particular species in metagenomic samples. In this review, we present an overview of these new algorithms and their uses, including methods based on assembly reconstruction and methods operating with or without a reference database. While existing metagenomic analysis methods show reasonable performance at the species and higher taxonomic levels, identifying closely related strains within a species presents a bigger challenge, due to the diversity of databases, genetic relatedness, and goals when conducting these analyses. Selection of which metagenomic tool to employ for a specific application should be performed on a case-by case basis as these tools have strengths and weaknesses that affect their performance on specific tasks. A comprehensive benchmark across different use case scenarios is vital to validate performance of these tools on microbial samples. Because strain-level metagenomic analysis is still in its infancy, development of more fine-grained, high-resolution algorithms will continue to be in demand for the future.Pattern Recognition and Bioinformatic

    SAFPred: Synteny-aware gene function prediction for bacteria using protein embeddings

    No full text
    MotivationToday, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models—adopted from the natural language processing field—have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.ResultsTo predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.Availability and implementationhttps://github.com/AbeelLab/safpred.Pattern Recognition and Bioinformatic

    Deciphering drug resistance in Mycobacterium tuberculosis using whole-genome sequencing: Progress, promise, and challenges

    No full text
    Tuberculosis (TB) is a global infectious threat that is intensified by an increasing incidence of highly drug-resistant disease. Whole-genome sequencing (WGS) studies of Mycobacterium tuberculosis, the causative agent of TB, have greatly increased our understanding of this pathogen. Since the first M. tuberculosis genome was published in 1998, WGS has provided a more complete account of the genomic features that cause resistance in populations of M. tuberculosis, has helped to fill gaps in our knowledge of how both classical and new antitubercular drugs work, and has identified specific mutations that allow M. tuberculosis to escape the effects of these drugs. WGS studies have also revealed how resistance evolves both within an individual patient and within patient populations, including the important roles of de novo acquisition of resistance and clonal spread. These findings have informed decisions about which drug-resistance mutations should be included on extended diagnostic panels. From its origins as a basic science technique, WGS of M. tuberculosis is becoming part of the modern clinical microbiology laboratory, promising rapid and improved detection of drug resistance, and detailed and real-time epidemiology of TB outbreaks. We review the successes and highlight the challenges that remain in applying WGS to improve the control of drug-resistant TB through monitoring its evolution and spread, and to inform more rapid and effective diagnostic and therapeutic strategies.Pattern Recognition and Bioinformatic

    QuantTB- A method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data

    No full text
    Background: Mixed infections of Mycobacterium tuberculosis and antibiotic heteroresistance continue to complicate tuberculosis (TB) diagnosis and treatment. Detection of mixed infections has been limited to molecular genotyping techniques, which lack the sensitivity and resolution to accurately estimate the multiplicity of TB infections. In contrast, whole genome sequencing offers sensitive views of the genetic differences between strains of M. tuberculosis within a sample. Although metagenomic tools exist to classify strains in a metagenomic sample, most tools have been developed for more divergent species, and therefore cannot provide the sensitivity required to disentangle strains within closely related bacterial species such as M. tuberculosis. Here we present QuantTB, a method to identify and quantify individual M. tuberculosis strains in whole genome sequencing data. QuantTB uses SNP markers to determine the combination of strains that best explain the allelic variation observed in a sample. QuantTB outputs a list of identified strains, their corresponding relative abundances, and a list of drugs for which resistance-conferring mutations (or heteroresistance) have been predicted within the sample. Results: We show that QuantTB has a high degree of resolution and is capable of differentiating communities differing by less than 25 SNPs and identifying strains down to 1× coverage. Using simulated data, we found QuantTB outperformed other metagenomic strain identification tools at detecting strains and quantifying strain multiplicity. In a real-world scenario, using a dataset of 50 paired clinical isolates from a study of patients with either reinfections or relapses, we found that QuantTB could detect mixed infections and reinfections at rates concordant with a manually curated approach. Conclusion: QuantTB can determine infection multiplicity, identify hetero-resistance patterns, enable differentiation between relapse and re-infection, and clarify transmission events across seemingly unrelated patients-even in low-coverage (1×) samples. QuantTB outperforms existing tools and promises to serve as a valuable resource for both clinicians and researchers working with clinical TB samples.Pattern Recognition and BioinformaticsElectrical Engineering, Mathematics and Computer Scienc

    Reply to Lee and Howden

    No full text
    Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Pattern Recognition and Bioinformatic

    Extensive global movement of multidrug-resistant <em>M. tuberculosis </em>strains revealed by whole-genome analysis

    No full text
    Background: While the international spread of multidrug-resistant (MDR) Mycobacterium tuberculosis strains is an acknowledged public health threat, a broad and more comprehensive examination of the global spread of MDR-tuberculosis (TB) using whole-genome sequencing has not yet been performed. Methods: In a global dataset of 5310 M. tuberculosis whole-genome sequences isolated from five continents, we performed a phylogenetic analysis to identify and characterise clades of MDR-TB with respect to geographic dispersion. Results: Extensive international dissemination of MDR-TB was observed, with identification of 32 migrant MDR-TB clades with descendants isolated in 17 unique countries. Relatively recent movement of strains from both Beijing and non-Beijing lineages indicated successful global spread of varied genetic backgrounds. Migrant MDR-TB clade members shared relatively recent common ancestry, with a median estimate of divergence of 13-27 years. Migrant extensively drug-resistant (XDR)-TB clades were not observed, although development of XDR-TB within migratory MDR-TB clades was common. Conclusions: Application of genomic techniques to investigate global MDR migration patterns revealed extensive global spread of MDR clades between countries of varying TB burden. Further expansion of genomic studies to incorporate isolates from diverse global settings into a single analysis, as well as data sharing platforms that facilitate genomic data sharing across country lines, may allow for future epidemiological analyses to monitor for international transmission of MDR-TB. In addition, efforts to perform routine whole-genome sequencing on all newly identified M. tuberculosis, like in England, will serve to better our understanding of the transmission dynamics of MDR-TB globally.Pattern Recognition and Bioinformatic

    SynerClust: A highly scalable, synteny-aware orthologue clustering tool

    No full text
    Accurate orthologue identification is a vital component of bacterial comparative genomic studies, but many popular sequence-similarity-based approaches do not scale well to the large numbers of genomes that are now generated routinely. Furthermore, most approaches do not take gene synteny into account, which is useful information for disentangling paralogues. Here, we present SynerClust, a user-friendly synteny-aware tool based on synergy that can process thousands of genomes. SynerClust was designed to analyse genomes with high levels of local synteny, particularly prokaryotes, which have operon structure. SynerClust’s run-time is optimized by selecting cluster representatives at each node in the phylogeny; thus, avoiding the need for exhaustive pairwise similarity searches. In benchmarking against Roary, Hieranoid2, PanX and Reciprocal Best Hit, SynerClust was able to more completely identify sets of core genes for datasets that included diverse strains, while using substantially less memory, and with scalability comparable to the fastest tools. Due to its scalability, ease of installation and use, and suitability for a variety of computing environments, orthogroup clustering using SynerClust will enable many large-scale prokaryotic comparative genomics efforts.Pattern Recognition and Bioinformatic

    Mycobacterium tuberculosis Whole Genome Sequences From Southern India Suggest Novel Resistance Mechanisms and the Need for Region-Specific Diagnostics

    Get PDF
    Background.India is home to 25% of all tuberculosis cases and the second highest number of multidrug resistant cases worldwide. However, little is known about the genetic diversity and resistance determinants of Indian Mycobacterium tuberculosis, particularly for the primary lineages found in India, lineages 1 and 3.Methods.We whole genome sequenced 223 randomly selected M. tuberculosis strains from 196 patients within the Tiruvallur and Madurai districts of Tamil Nadu in Southern India. Using comparative genomics, we examined genetic diversity, transmission patterns, and evolution of resistance.Results.Genomic analyses revealed (1) prevalence of strains from lineages 1 and 3, (2) recent transmission of strains among patients from the same treatment centers, (3) emergence of drug resistance within patients over time, (4) resistance gained in an order typical of strains from different lineages and geographies, (5) underperformance of known resistance-conferring mutations to explain phenotypic resistance in Indian strains relative to studies focused on other geographies, and (6) the possibility that resistance arose through mutations not previously implicated in resistance, or through infections with multiple strains that confound genotype-based prediction of resistance.Conclusions.In addition to substantially expanding the genomic perspectives of lineages 1 and 3, sequencing and analysis of M. tuberculosis whole genomes from Southern India highlight challenges of infection control and rapid diagnosis of resistant tuberculosis using current technologies. Further studies are needed to fully explore the complement of diversity and resistance determinants within endemic M. tuberculosis populations.Pattern Recognition and Bioinformatic

    StrainGE: a toolkit to track and characterize low-abundance strains in complex microbial communities

    Get PDF
    Human-associated microbial communities comprise not only complex mixtures of bacterial species, but also mixtures of conspecific strains, the implications of which are mostly unknown since strain level dynamics are underexplored due to the difficulties of studying them. We introduce the Strain Genome Explorer (StrainGE) toolkit, which deconvolves strain mixtures and characterizes component strains at the nucleotide level from short-read metagenomic sequencing with higher sensitivity and resolution than other tools. StrainGE is able to identify strains at 0.1x coverage and detect variants for multiple conspecific strains within a sample from coverages as low as 0.5x.Pattern Recognition and Bioinformatic
    corecore