18 research outputs found

    Safe and complete contig assembly via omnitigs

    Full text link
    Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph GG (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from GG as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

    Safe solutions for walks on graphs

    Get PDF
    In this thesis we study the concept of “safe solutions” in different problems whose solutions are walks on graphs. A safe solution to a problem X can be understood as a partial solution common to all solutions to problem X. In problems whose solutions are walks on graphs, safe solutions refer to walks common to all walks which are solutions to the problem. In this thesis, we focused on formulating four main graph traversal problems and finding characterizations for those walks contained in all their solutions. We give formulations for these graph traversal problems, we prove some of their combinatorial and structural properties, and we give safe and complete algorithms for finding their safe solutions based on their characterizations. We use the genome assembly problem and its applications as our main motivating example for finding safe solutions in these graph traversal problems. We begin by motivating and exemplifying the notion of safe solutions through a problem on s-t paths in undirected graphs with at least two non-trivial biconnected components S and T and with s ∈ S, t ∈ T . We continue by reviewing similar and related notions in other fields, especially in combinatorial optimization and previous work on the bioinformatics problem of genome assembly. We then proceed to characterize the safe solutions to the Eulerian cycle problem, where one must find a circular walk in a graph G which traverses each edge exactly once. We suggest a characterization for them by improving on (Nagarajan, Pop, JCB 2009) and a polynomial-time algorithm for finding them. We then study edge-covering circular walks in a graph G. We look at the characterization from (Tomescu, Medvedev, JCB 2017) for their safe solutions and their suggested polynomial-time algorithm and we show an optimal O(mn)-time algorithm that we proposed in (Cairo et al. CPM 2017). Finally, we generalize this to edge-covering collections of circular walks. We characterize safe solutions in an edge-covering setting and provide a polynomial-time algorithm for computing them. We suggested these originally in (Obscura et al. ALMOB 2018)

    An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph

    Get PDF
    In this article, we consider the following problem. Given a directed graph G, output all walks of G that are sub-walks of all closed edge-covering walks of G. This problem was first considered by Tomescu and Medvedev (RECOMB 2016), who characterized these walks through the notion of omnitig. Omnitigs were shown to be relevant for the genome assembly problem from bioinformatics, where a genome sequence must be assembled from a set of reads from a sequencing experiment. Tomescu and Medvedev (RECOMB 2016) also proposed an algorithm for listing all maximal omnitigs, by launching an exhaustive visit from every edge. In this article, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm arid show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev (RECOMB 2016).Peer reviewe

    Safety in s-t Paths, Trails and Walks

    Get PDF
    Given a directed graph G and a pair of nodes s and t, an s-t bridge of G is an edge whose removal breaks all s-t paths of G (and thus appears in all s-t paths). Computing all s-t bridges of G is a basic graph problem, solvable in linear time. In this paper, we consider a natural generalisation of this problem, with the notion of “safety” from bioinformatics. We say that a walk W is safe with respect to a set W' of s-t walks, if W is a subwalk of all walks in W'. We start by considering the maximal safe walks when consists of: all s-t paths, all s-t trails, or all s-t walks of G. We show that the solutions for the first two problems immediately follow from finding all s-t bridges after incorporating simple characterisations. However, solving the third problem requires non-trivial techniques for incorporating its characterisation. In particular, we show that there exists a compact representation computable in linear time, that allows outputting all maximal safe walks in time linear in their length. Our solutions also directly extend to multigraphs, except for the second problem, which requires a more involved approach. We further generalise these problems, by assuming that safety is defined only with respect to a subset of visible edges. Here we prove a dichotomy between the s-t paths and s-t trails cases, and the s-t walks case: the former two are NP-hard, while the latter is solvable with the same complexity as when all edges are visible. We also show that the same complexity results hold for the analogous generalisations of s-t articulation points (nodes appearing in all s-t paths). We thus obtain the best possible results for natural “safety”-generalisations of these two fundamental graph problems. Moreover, our algorithms are simple and do not employ any complex data structures, making them ideal for use in practice.Peer reviewe
    corecore