13,057 research outputs found

    A new algorithm for de novo genome assembly

    Get PDF
    The enormous amount of short reads produced by next generation sequencing (NGS) techniques such as Roche/454, Illumina/Solexa and SOLiD sequencing opened the possibility of de novo genome assembly. Some of the de novo genome assemblers (e.g., Edena, SGA) use an overlap graph approach to assemble a genome, while others (e.g., ABySS and SOAPdenovo) use a de Bruijn graph approach. Currently, the approaches based on the de Bruijn graph are the most successful, yet their performance is far from being able to assemble entire genomic sequences. We developed a new overlap graph based genome assembler called Paired-End Genome ASsembly Using Short-sequences (PEGASUS) for paired-end short reads produced by NGS techniques. PEGASUS uses a minimum cost network flow approach to predict the copy count of the input reads more precisely than other algorithms. With the help of accurate copy count and mate pair support, PEGASUS can accurately unscramble the paths in the overlap graph that correspond to DNA sequences. PEGASUS exhibits comparable and in many cases better performance than the leading genome assemblers

    Design and Analysis of Cryptographic Pseudorandom Number/Sequence Generators with Applications in RFID

    Get PDF
    This thesis is concerned with the design and analysis of strong de Bruijn sequences and span n sequences, and nonlinear feedback shift register (NLFSR) based pseudorandom number generators for radio frequency identification (RFID) tags. We study the generation of span n sequences using structured searching in which an NLFSR with a class of feedback functions is employed to find span n sequences. Some properties of the recurrence relation for the structured search are discovered. We use five classes of functions in this structured search, and present the number of span n sequences for 6 <= n <= 20. The linear span of a new span n sequence lies between near-optimal and optimal. According to our empirical studies, a span n sequence can be found in the structured search with a better probability of success. Newly found span n sequences can be used in the composited construction and in designing lightweight pseudorandom number generators. We first refine the composited construction based on a span n sequence for generating long de Bruijn sequences. A de Bruijn sequence produced by the composited construction is referred to as a composited de Bruijn sequence. The linear complexity of a composited de Bruijn sequence is determined. We analyze the feedback function of the composited construction from an approximation point of view for producing strong de Bruijn sequences. The cycle structure of an approximated feedback function and the linear complexity of a sequence produced by an approximated feedback function are determined. A few examples of strong de Bruijn sequences with the implementation issues of the feedback functions of an (n+16)-stage NLFSR are presented. We propose a new lightweight pseudorandom number generator family, named Warbler family based on NLFSRs for smart devices. Warbler family is comprised of a combination of modified de Bruijn blocks (CMDB) and a nonlinear feedback Welch-Gong (WG) generator. We derive the randomness properties such as period and linear complexity of an output sequence produced by the Warbler family. Two instances, Warbler-I and Warbler-II, of the Warbler family are proposed for passive RFID tags. The CMDBs of both Warbler-I and Warbler-II contain span n sequences that are produced by the structured search. We analyze the security properties of Warbler-I and Warbler-II by considering the statistical tests and several cryptanalytic attacks. Hardware implementations of both instances in VHDL show that Warbler-I and Warbler-II require 46 slices and 58 slices, respectively. Warbler-I can be used to generate 16-bit random numbers in the tag identification protocol of the EPC Class 1 Generation 2 standard, and Warbler-II can be employed as a random number generator in the tag identification as well as an authentication protocol for RFID systems.1 yea

    On Greedy Algorithms for Binary de Bruijn Sequences

    Full text link
    We propose a general greedy algorithm for binary de Bruijn sequences, called Generalized Prefer-Opposite (GPO) Algorithm, and its modifications. By identifying specific feedback functions and initial states, we demonstrate that most previously-known greedy algorithms that generate binary de Bruijn sequences are particular cases of our new algorithm

    Jabba: hybrid error correction for long sequencing reads

    Get PDF
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph

    Extreme Scale De Novo Metagenome Assembly

    Full text link
    Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.Comment: Accepted to SC1
    corecore