1,716 research outputs found

    Adapting the Phylogenetic Program FITCH for Distributed Processing

    Get PDF
    The ability to reconstruct optimal phylogenies (evolutionary trees) based on objective criteria impacts directly on our understanding the relationships among organisms, including human evolution, as well as the spread of infectious disease. Numerous tree construction methods have been implemented for execution on single processors, however inferring large phylogenies using computationally intense algorithms can be beyond the practical capacity of a single processor. Distributed and parallel processing provides a means for overcoming this hurdle. FITCH is a freely available, single-processor implementation of a distance-based, tree-building algorithm commonly used by the biological community. Through an alternating least squares approach to branch length optimization and tree comparison, FITCH iteratively builds up evolutionary trees through species addition and branch rearrangement. To extend the utility of this program, I describe the design, implementation, and performance of mpiFITCH, a parallel processing version of FITCH developed using the Message Passing Interface for message exchange. Balanced load distribution required the conversion of tree generation from recursive linked list traversal to iterative, array-based traversal. Execution of mpiFITCH on a Beowulf cluster running 64 processors revealed maximum performance enhancement of up to ~28 fold with an efficiency of ~ 40%

    Analyze Large Multidimensional Datasets Using Algebraic Topology

    Get PDF
    This paper presents an efficient algorithm to extract knowledge from high-dimensionality, high- complexity datasets using algebraic topology, namely simplicial complexes. Based on concept of isomorphism of relations, our method turn a relational table into a geometric object (a simplicial complex is a polyhedron). So, conceptually association rule searching is turned into a geometric traversal problem. By leveraging on the core concepts behind Simplicial Complex, we use a new technique (in computer science) that improves the performance over existing methods and uses far less memory. It was designed and developed with a strong emphasis on scalability, reliability, and extensibility. This paper also investigate the possibility of Hadoop integration and the challenges that come with the framework

    Inference of Many-Taxon Phylogenies

    Get PDF
    Phylogenetic trees are tree topologies that represent the evolutionary history of a set of organisms. In this thesis, we address computational challenges related to the analysis of large-scale datasets with Maximum Likelihood based phylogenetic inference. We have approached this using different strategies: reduction of memory requirements, reduction of running time, and reduction of man-hours

    Multi-agent based beam search for intelligent production planning and scheduling

    Get PDF
    Production planning and scheduling is a long standing research area of great practical value, while industrial demand for production planning and scheduling systems is acute. Regretfully, most research results are seldom applied in industry because existing planning and scheduling methods can barely meet the requirements for practical applications. This paper identifies four major requirements, namely generality, solution quality, computation efficiency, and implementation difficulty, for practical production planning and scheduling methods. Based on these requirements, method, a multi-agent based beam search (MABBS), is developed. It seamlessly integrates the multi-agent system (MAS) method and beam search (BS) method into a generic multi-stage multi-level decision making (MSMLDM) model to systematically address all the four requirements within a unified framework. A script language, called EXASL, and an open software platform are developed to simplify the implementation of the MABBS method. For solving complex real-world problems, an MABBS-based prototype production planning, scheduling and execution system is developed. The feasibility and effectiveness of this study is demonstrated with the prototype system and computation experiments. © 2010 Taylor & Francis.postprin

    Compressing Massive Sequencing Data with Multiple Attribute Tree

    Get PDF
    The significant drop in DNA Sequencing costs caused by Next-Generation Sequencing has led to the production of massive amounts of raw sequencing data. This data is stored in FASTQ files, which are text files containing a large number of reads, each composed of a short DNA sequence and its associated identifier and quality score. The DNA sequence is a string of fixed length over the alphabet Σ = {A, C, T, G, N}, the identifier is an arbitrary string that is sequencer-dependent, and the quality score is a string of the same length as the DNA sequence, indicating for each base how confident the sequencer was when determining it. These files can range from a few gigabytes to hundreds of gigabytes, which poses a Big Data challenge, as the growth of generated sequencing data now exceeds the decrease of storage hardware price. Therefore, storing and transmitting such data requires more performant compression algorithms than general purpose compressors such as gzip, the de facto standard. Many different specialized compressors have been proposed to tackle this problem. In this thesis, we review currently existing compressors for FASTQ files and we propose a novel compression algorithm for DNA sequences, MATC, for Multiple Attribute Tree Compression. Our algorithm divides DNA sequences into k-mers, i.e., substrings of length k, and performs column-wise compression using a multiple attribute tree. In our case the multiple attribute tree is a complete tree where each node is a k-mer and each leaf represents the sequence formed by the concatenation of its parent k-mers. The tree is then stored using level-order traversal and k-mers are compressed using Huffman encoding. We show that our algorithm offers compression ratios comparable to the current specialized compressors. Moreover, we propose a distributed version of our algorithm, allowing the compression of larger files across a cluster of machines. This allows compression to be processed in the cloud, rather than on commodity hardware, which will become less and less suited to handle the growing size of generated sequencing data

    Compressing DNA sequence databases with coil

    Get PDF
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

    Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

    Get PDF
    Phylogenetic inference is considered to be one of the grand challenges in Bioinformatics due to the immense computational requirements. RAxML is currently among the fastest and most accurate programs for phylogenetic tree inference under the Maximum Likelihood (ML) criterion. First, we introduce new tree search heuristics that accelerate RAxML by a factor of 2.43 while returning equally good trees. The performance of the new search algorithm has been assessed on 18 real-world datasets comprising 148 up to 4,843 DNA sequences. We then present the implementation, optimization, and evaluation of RAxML on the IBM Cell Broadband Engine. We address the problems and provide solutions pertaining to the optimization of floating point code, control flow, communication, and scheduling of multi-level parallelism on the Cel
    • …
    corecore