142 research outputs found

    Recent advances in inferring viral diversity from high-throughput sequencing data

    Get PDF
    Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.ISSN:0168-170

    Analysis of Next-generation Sequencing Data in Virology - Opportunities and Challenges

    Get PDF
    Viruses are the most abundant and the smallest organisms, which are relatively simple to sequence. Genome sequence data of viruses for individual species to populations outnumber that of other species. Although this offers an opportunity to study viral diversity at varying levels of taxonomic hierarchy, it also poses challenges for systematic and structured organization of data and its downstream processing. Extensive computational analyses using a number of algorithms and programs have opened exciting opportunities for virus discovery and diagnostics, apart from augmenting our understanding of the intriguing world of viruses. Unravelling evolutionary dynamics of viruses permits improved understanding of phenomena such as quasispecies diversity, role of mutations in host switching and drug resistance, which enables the tangible measurements of genotype and phenotype of viruses. Improved understanding of geno-/serotype diversity in correlation with antigenic diversity will facilitate rational design and development of efficacious vaccines against emerging and re-emerging viruses. Mathematical models developed using the genomic data could be used to predict the spread of viruses due to vector switching and the (re)emergence due to host switching and, thereby, contribute towards designing public health policies for disease management and control

    Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness

    Get PDF
    Background Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. The lack of knowledge on the number of different strains in a quasispecies population is observed to hinder the precision of existing Viral Quasispecies Spectrum Reconstruction (QSR) methods due to the uncontrolled reconstruction of a large number of in silico false positives. In this work, we formulated a novel probabilistic method for strain richness estimation specifically targeting viral quasispecies. By using this approach we improved our recently proposed spectrum reconstruction pipeline ViQuaS to achieve higher levels of precision in reconstructed quasispecies spectra without compromising the recall rates. We also discuss how one other existing popular QSR method named ShoRAH can be improved using this new approach. Results On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. Conclusions The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors

    Accurate Viral Population Assembly From Ultra-Deep Sequencing Data

    Get PDF
    Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads

    Algorithms for analysis of next-generation viral sequencing data

    Get PDF
    RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host’s immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health. Progress of sequencing technologies made it possible to identify and sample intra-host viral populations at great depth. Consequently, the contribution of sequencing technologies to molecular surveillance of viral outbreaks becomes more and more substantial. Genome sequencing of viral populations reveals similarities between samples, allows to measure viral genetic distance and facilitate outbreak identification and isolation. Computational methods can be used to infer transmission characteristics from sequencing data. However, due to the specifics of next-generation sequencing (NGS) approaches, and the limited availability of viral data, existing methods lack accuracy and efficiency. In this dissertation, I present a novel, flexible methods, that allow tackling crucial epidemiological problems, such as identification of transmission clusters, sources of infection, and transmission direction

    Algorithms for Viral Population Analysis

    Get PDF
    The genetic structure of an intra-host viral population has an effect on many clinically important phenotypic traits such as escape from vaccine induced immunity, virulence, and response to antiviral therapies. Next-generation sequencing provides read-coverage sufficient for genomic reconstruction of a heterogeneous, yet highly similar, viral population; and more specifically, for the detection of rare variants. Admittedly, while depth is less of an issue for modern sequencers, the short length of generated reads complicates viral population assembly. This task is worsened by the presence of both random and systematic sequencing errors in huge amounts of data. In this dissertation I present completed work for reconstructing a viral population given next-generation sequencing data. Several algorithms are described for solving this problem under the error-free amplicon (or sliding-window) model. In order for these methods to handle actual real-world data, an error-correction method is proposed. A formal derivation of its likelihood model along with optimization steps for an EM algorithm are presented. Although these methods perform well, they cannot take into account paired-end sequencing data. In order to address this, a new method is detailed that works under the error-free paired-end case along with maximum a-posteriori estimation of the model parameters

    Analysis of NGS Data from Immune Response and Viral Samples

    Get PDF
    This thesis is devoted to designing and applying advanced algorithmical and statistical tools for analysis of NGS data related to cancer and infection diseases. NGS data under investigation are obtained either from host samples or viral variants. Recently, random peptide phage display libraries (RPPDL) were applied to studies of host\u27s antibody response to different diseases. We study human antibody response to breast cancer and mouse antibody response to Lyme disease by sequencing of the whole antibody repertoire profiles which are represented by RPPDL. Alternatively, instead of sequencing immune response NGS can be applied directly to a viral population within an infected host. Specifically, we analyze the following RNA viruses: the human immunodeficiency virus (HIV) and the infectious bronchitis virus (IBV). Sequencing of RNA viruses is challenging because there are many variants inside population due to high mutation rate. Our results show that NGS helps to understand RNA viruses and explore their interaction with infected hosts. NGS also helps to analyze immune response to different diseases, trace changing of immune response at different disease stages

    SAMFIRE: multi-locus variant calling for time-resolved sequence data

    Get PDF
    An increasingly common method for studying evolution is the collection of time-resolved short-read sequence data. Such datasets allow for the direct observation of rapid evolutionary processes, as might occur in natural microbial populations and in evolutionary experiments. In many circumstances, evolutionary pressure acting upon single variants can cause genomic changes at multiple nearby loci. SAMFIRE is an open-access software package for processing and analysing sequence reads from time-resolved data, calling important single- and multi-locus variants over time, identifying alleles potentially affected by selection, calculating linkage disequilibrium statistics, performing haplotype reconstruction, and exploiting time-resolved information to estimate the extent of uncertainty in reported genomic data.CI was supported by a Sir Henry Dale Fellowship, jointly funded by the Wellcome Trust and the Royal Society (Grant Number 101239/Z/13/Z).This is the author accepted manuscript. The final version is available from Oxford University Press via http://dx.doi.org/10.1093/bioinformatics/btw20
    • …
    corecore