2 research outputs found

    Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

    Get PDF
    BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

    DNA Compression

    Get PDF
    Import 04/11/2015Komprese DNA sekvencí je považována za obtížný úkol. Její význam nehraje roli pouze pro úsporu diskového prostoru a využití sítě při přenosu souborů s genomy. Přínosem je také rozpoznávání modelů uvnitř biologických sekvencí a určování evoluční vzdálenosti mezi organizmy. Tato diplomová práce začíná náhledem do biologie a základním poznáním struktury DNA sekvencí. Následuje chronologicky seřazený rozbor vybraných kompresních programů. Je popsána jejich strategie, algoritmy a rozdílné způsoby řešení společných dílčích problémů. Na základě získaných znalostí a poznatků z testování je navržena vlastní metoda, která je implementována v programu DNAcod. Dosažené výsledky jsou porovnány s ostatními kompresními nástroji.DNA compression is considered as a challenging task. It is not useful just for saving the disk space and network bandwidth while transferring genome file. The benefit is also in recognition of the patterns in biological sequences and measuring the evolutionary distance between organisms. This thesis starts with insight into Biology and basic knowledge of DNA sequence structure. It is followed by chronologically ordered chapters with analysis of chosen compression programs. Their strategies, algorithms and different types of solution for common partial issues are described. Based on the gained knowledge and experience from testing own compression method has been designed and then implemented in DNAcod program. Achieved outcome numbers are compared with other compression tools results.460 - Katedra informatikyvýborn
    corecore