96 research outputs found

    Nephele: genotyping via complete composition vectors and MapReduce

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.</p> <p>Results</p> <p>Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.</p> <p>Conclusions</p> <p>We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p

    A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing

    Get PDF
    Evolutionary relationships among birds in Neoaves, the clade comprising the vast majority of avian diversity, have vexed systematists due to the ancient, rapid radiation of numerous lineages. We applied a new phylogenomic approach to resolve relationships in Neoaves using target enrichment (sequence capture) and high-throughput sequencing of ultraconserved elements (UCEs) in avian genomes. We collected sequence data from UCE loci for 32 members of Neoaves and one outgroup (chicken) and analyzed data sets that differed in their amount of missing data. An alignment of 1,541 loci that allowed missing data was 87% complete and resulted in a highly resolved phylogeny with broad agreement between the Bayesian and maximum-likelihood (ML) trees. Although results from the 100% complete matrix of 416 UCE loci were similar, the Bayesian and ML trees differed to a greater extent in this analysis, suggesting that increasing from 416 to 1,541 loci led to increased stability and resolution of the tree. Novel results of our study include surprisingly close relationships between phenotypically divergent bird families, such as tropicbirds (Phaethontidae) and the sunbittern (Eurypygidae) as well as between bustards (Otididae) and turacos (Musophagidae). This phylogeny bolsters support for monophyletic waterbird and landbird clades and also strongly supports controversial results from previous studies, including the sister relationship between passerines and parrots and the non-monophyly of raptorial birds in the hawk and falcon families. Although significant challenges remain to fully resolving some of the deep relationships in Neoaves, especially among lineages outside the waterbirds and landbirds, this study suggests that increased data will yield an increasingly resolved avian phylogeny.Comment: 30 pages, 1 table, 4 figures, 1 supplementary table, 3 supplementary figure

    Open Reading Frame Phylogenetic Analysis on the Cloud

    Get PDF

    MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MapReduce is a parallel framework that has been used effectively to design large-scale parallel applications for large computing clusters. In this paper, we evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. We introduce MrsRF (<it>MapReduce Speeds up RF</it>), a multi-core algorithm to generate a <it>t </it>× <it>t </it>Robinson-Foulds distance matrix between <it>t </it>trees using the MapReduce paradigm.</p> <p>Results</p> <p>We studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores. Our results also show that achieving top speedup on a multi-core cluster requires different cluster configurations. Finally, we show how to use an RF matrix to summarize collections of phylogenetic trees visually.</p> <p>Conclusion</p> <p>Our results show that MapReduce is a promising paradigm for developing multi-core phylogenetic applications. The results also demonstrate that different multi-core configurations must be tested in order to obtain optimum performance. We conclude that RF matrices play a critical role in developing techniques to summarize large collections of trees.</p

    A Low-pass sequencing approach to phylogenetic analysis: reconstructing Sardinian and European demographic history with a panel of 1200 Y-chromosome samples

    Get PDF
    Aim: The origins of contemporary populations can be clarified by studying the genetic variation within the male-specific portion of the Y chromosome (MSY). The phylogenesis over this region has been subject of several studies in the past years. In the present study we took advantage of the large scale whole genome sequencing studies we have been carrying on to build a phylogenetic map of Y chromosome with an unprecedented resolution, over which we calculated the putative age for coalescence for our samples. Methods: The study involves 1,204 male samples from Sardinia. A complete variant call on the samples is performed and a statistical approach at a first stage and a second stage hierarchical approach are applied to respectively discard / correct errors and select informative variants. A phylogenetic tree is built and TMRCA calculations are performed. Results: The following haplogroups have been unambiguously detected (A, E, F, G, I, J, K, P, R) over the 1,204 samples and 11,763 informative markers have been discovered, among which 6,751 have not previously been observed. We calibrated the tree with archaeological data and used it to calculate a putative age for coalescence of (190±10)·10^3 years ago. Conclusion: This study shows that Sardinian population carries most of European variability, doubles the number of previously known human phylogenetically informative markers for Y chromosome and provides an estimate for coalescence which is closer to previous mitochondrial DNA estimates than in previous studies on the MSY.</br

    An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

    Get PDF
    • …
    corecore