46 research outputs found

    Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees

    Get PDF
    Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum parsimony (MP) and maximum likelihood (ML). Conventional MP heuristics for producing phylogenetic trees produce good solutions within reasonable time on small datasets (up to a few thousand sequences), while ML heuristics are limited to smaller datasets (up to a few hundred sequences). However, since MP (and presumably ML) is NP-hard, such approaches do not scale when applied to large datasets. In this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family of disk-covering methods (DCMs). We tested this new technique on ten large biological datasets ranging from 1,322 to 13,921 sequences and obtained dramatic speedups as well as significant improvements in accuracy (better than 99.99%) in comparison to existing approaches. Thus, high-quality reconstructions can be obtained for datasets at least ten times larger than was previously possible

    An efficient and extensible approach for compressing phylogenetic trees

    Get PDF
    Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. TreeZip is a novel method for compressing phylogenetic trees. Recently, we extended our TreeZip algorithm to support branch lengths and show how it can be used to extract sets of trees of interest quickly. The key advantage of TreeZip over standard compression methods like 7zip is its ability to interpret and compress tree collections semantically, making it immune to branch rotations and allowing key operations (such calculating a consensus tree) to be performed quickly and without a loss of space savings. On unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99 % (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings. TreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community.

    MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MapReduce is a parallel framework that has been used effectively to design large-scale parallel applications for large computing clusters. In this paper, we evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. We introduce MrsRF (<it>MapReduce Speeds up RF</it>), a multi-core algorithm to generate a <it>t </it>× <it>t </it>Robinson-Foulds distance matrix between <it>t </it>trees using the MapReduce paradigm.</p> <p>Results</p> <p>We studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores. Our results also show that achieving top speedup on a multi-core cluster requires different cluster configurations. Finally, we show how to use an RF matrix to summarize collections of phylogenetic trees visually.</p> <p>Conclusion</p> <p>Our results show that MapReduce is a promising paradigm for developing multi-core phylogenetic applications. The results also demonstrate that different multi-core configurations must be tested in order to obtain optimum performance. We conclude that RF matrices play a critical role in developing techniques to summarize large collections of trees.</p

    A General-Purpose Model for Heterogeneous Computation

    No full text
    Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing efficient heterogeneous parallel programs. We propos
    corecore