1,626 research outputs found

    High-Performance Computing Frameworks for Large-Scale Genome Assembly

    Get PDF
    Genome sequencing technology has witnessed tremendous progress in terms of throughput and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn graph-based assembly tools demand a lot of processing power and memory and cannot assemble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out cluster with several dozens of nodes. In the first part of this work, we present a distributed next-generation sequence (NGS) assembler called Lazer, that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours. In the second part, we present a new distributed GPU-accelerated NGS assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps in quasi-linear time. To use the limited memory on GPUs efficiently, LaSAGNA uses a two-level semi-streaming approach from disk through host memory to device memory with restricted access patterns on both disk and host memory. Using LaSAGNA, we can assemble the human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s. In the third part, we present the first distributed 3rd generation sequence (3GS) assembler which uses a map-reduce computing paradigm and a distributed hash-map, both built on a high-performance networking middleware. Using this assembler, we assembled an Oxford Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas existing 3GS assemblers could not assemble it because of memory and/or time limitations

    Combinatorics on Words. New Aspects on Avoidability, Defect Effect, Equations and Palindromes

    Get PDF
    In this thesis we examine four well-known and traditional concepts of combinatorics on words. However the contexts in which these topics are treated are not the traditional ones. More precisely, the question of avoidability is asked, for example, in terms of k-abelian squares. Two words are said to be k-abelian equivalent if they have the same number of occurrences of each factor up to length k. Consequently, k-abelian equivalence can be seen as a sharpening of abelian equivalence. This fairly new concept is discussed broader than the other topics of this thesis. The second main subject concerns the defect property. The defect theorem is a well-known result for words. We will analyze the property, for example, among the sets of 2-dimensional words, i.e., polyominoes composed of labelled unit squares. From the defect effect we move to equations. We will use a special way to define a product operation for words and then solve a few basic equations over constructed partial semigroup. We will also consider the satisfiability question and the compactness property with respect to this kind of equations. The final topic of the thesis deals with palindromes. Some finite words, including all binary words, are uniquely determined up to word isomorphism by the position and length of some of its palindromic factors. The famous Thue-Morse word has the property that for each positive integer n, there exists a factor which cannot be generated by fewer than n palindromes. We prove that in general, every non ultimately periodic word contains a factor which cannot be generated by fewer than 3 palindromes, and we obtain a classification of those binary words each of whose factors are generated by at most 3 palindromes. Surprisingly these words are related to another much studied set of words, Sturmian words.Siirretty Doriast

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF
    • …
    corecore