22 research outputs found

    Inferring Genomic Sequences

    Get PDF
    Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies. The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic

    Individual-specific changes in the human gut microbiota after challenge with enterotoxigenic Escherichia coli and subsequent ciprofloxacin treatment

    Get PDF
    Acknowledgements The authors wish to thank Mark Stares, Richard Rance, and other members of the Wellcome Trust Sanger Institute’s 454 sequencing team for generating the 16S rRNA gene data. Lili Fox Vélez provided editorial support. Funding IA, JNP, and MP were partly supported by the NIH, grants R01-AI-100947 to MP, and R21-GM-107683 to Matthias Chung, subcontract to MP. JNP was partly supported by an NSF graduate fellowship number DGE750616. IA, JNP, BRL, OCS and MP were supported in part by the Bill and Melinda Gates Foundation, award number 42917 to OCS. JP and AWW received core funding support from The Wellcome Trust (grant number 098051). AWW, and the Rowett Institute of Nutrition and Health, University of Aberdeen, receive core funding support from the Scottish Government Rural and Environmental Science and Analysis Service (RESAS).Peer reviewedPublisher PD

    De novo likelihood-based measures for comparing genome assemblies

    Get PDF
    The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.https://doi.org/10.1186/1756-0500-6-33

    Original Article

    Get PDF
    The employment of malaria therapy for neurosyphilis has been decreasing since penicillin and other antibiotics appeared and neurosyphilis patients decreased recently in their number. But malaria therapy is one of the most effective therapies for neurosyphilis still now. So we must find out how to keep alive malaria blood not in vivo, simply. The results were: 1) The temperature in which malaria blood was kept, decided its fate. The preservation under 4℃, -20℃ was not suitable to keep alive malaria blood long. 2) The solution in a ratio of 4 parts of malaria blood to I part of ACD solution (anti-coagulant) was added by 1.2 to 2.5 mol. amounts of glycerin and then freezing it rapidly at a temperature of -79℃, quick thawing and injecting it intramuscularly among 65 subjects, infection was accomplished sufficiently in 54 subjects with no malaria history. The storage period was 3-242 days. Its incubation period was 12 to 28 days and the average 14.6 days. At present, the longest preservation period is 242 days. In case of slight prolongation of incubation subsequent to long preservation and the parasites figures of smears of Giemsa method, there is possibility of longer preservation than 242 days which is the longest period at this time. This method is simple, practical for malaria preservation. In this case, the factors to determine whether the blood was effectable or not effectable concerned the numbers of parasites in the blood before frozen. 3) Although the freezing drying method did not succeed this time, its possibility can be expected by observing the reconstruction of malaria parasites in glycerin using example. 4) As author described above, glycerin acts effectively on frozen-keeping of malaria protozoa, too

    Inferring viral quasispecies spectra from 454 pyrosequencing reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences.</p> <p>Results</p> <p>In this paper, we introduce a new <b>Vi</b>ral <b>Sp</b>ectrum <b>A</b>ssembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at <url>http://alla.cs.gsu.edu/~software/VISPA/vispa.html</url>.</p> <p>Conclusions</p> <p>ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.</p