189 research outputs found

    On Greedy Algorithms for Binary de Bruijn Sequences

    Full text link
    We propose a general greedy algorithm for binary de Bruijn sequences, called Generalized Prefer-Opposite (GPO) Algorithm, and its modifications. By identifying specific feedback functions and initial states, we demonstrate that most previously-known greedy algorithms that generate binary de Bruijn sequences are particular cases of our new algorithm

    A maximum likelihood approach to genome assembly

    Get PDF
    De novo genome assembly is the bioinformatics' problem to reconstruct the original molecule from its sub-sequences, with no previous knowledge on DNA. Inspired by the maximum likelihood approach, recently a new experimental approach was developed. In this thesis, for the first time this new stochastic approach has been implemented into a software assembler. A parallel software has been developed in order to obtain a first experimental validation of the model, testing also some artificial dataope

    Synteny Paths for Assembly Graphs Comparison

    Get PDF
    Despite the recent developments of long-read sequencing technologies, it is still difficult to produce complete assemblies of eukaryotic genomes in an automated fashion. Genome assembly software typically output assembled fragments (contigs) along with assembly graphs, that encode all possible layouts of these contigs. Graph representation of the assembled genome can be useful for gene discovery, haplotyping, structural variations analysis and other applications. To facilitate the development of new graph-based approaches, it is important to develop algorithms for comparison and evaluation of assembly graphs produced by different software. In this work, we introduce synteny paths: maximal paths of homologous sequence between the compared assembly graphs. We describe Asgan - an algorithm for efficient synteny paths decomposition, and use it to evaluate assembly graphs of various bacterial assemblies produced by different approaches. We then apply Asgan to discover structural variations between the assemblies of 15 Drosophila genomes, and show that synteny paths are robust to contig fragmentation. The Asgan tool is freely available at: https://github.com/epolevikov/Asgan

    SOPRA: Scaffolding algorithm for paired reads via statistical optimization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, <it>de novo </it>assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.</p> <p>Results</p> <p>We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.</p> <p>Conclusions</p> <p>Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.</p

    Towards a theory of patches

    Get PDF
    AbstractMany applications have a need for indexing unstructured data. It turns out that a similar ad-hoc method is being used in many of them – that of considering small particles of the data.In this paper we formalize this concept as a tiling problem and consider the efficiency of dealing with this model in the pattern matching setting.We present an efficient algorithm for the one-dimensional tiling problem, and the one-dimensional tiled pattern matching problem. We prove the two-dimensional problem is hard and then develop an approximation algorithm with an approximation ratio converging to 2. We show that other two-dimensional versions of the problem are also hard, regardless of the number of neighbors a tile has
    • …
    corecore