154 research outputs found

    A note on the shortest common superstring of NGS reads

    Full text link
    The Shortest Superstring Problem (SSP) consists, for a set of strings S = {s_1,...,s_n}, to find a minimum length string that contains all s_i, 1 <= i <= k, as substrings. This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being 2+11/23, which has been achieved following a long and difficult quest. However, SSP is highly used in practice on next generation sequencing (NGS) data, which plays an increasingly important role in sequencing. In this note, we show that the SSP approximation ratio can be improved on NGS reads by assuming specific characteristics of NGS data that are experimentally verified on a very large sampling set

    Collapsing Superstring Conjecture

    Get PDF
    In the Shortest Common Superstring (SCS) problem, one is given a collection of strings, and needs to find a shortest string containing each of them as a substring. SCS admits 2 11/23-approximation in polynomial time (Mucha, SODA\u2713). While this algorithm and its analysis are technically involved, the 30 years old Greedy Conjecture claims that the trivial and efficient Greedy Algorithm gives a 2-approximation for SCS. We develop a graph-theoretic framework for studying approximation algorithms for SCS. The framework is reminiscent of the classical 2-approximation for Traveling Salesman: take two copies of an optimal solution, apply a trivial edge-collapsing procedure, and get an approximate solution. In this framework, we observe two surprising properties of SCS solutions, and we conjecture that they hold for all input instances. The first conjecture, that we call Collapsing Superstring conjecture, claims that there is an elementary way to transform any solution repeated twice into the same graph G. This conjecture would give an elementary 2-approximate algorithm for SCS. The second conjecture claims that not only the resulting graph G is the same for all solutions, but that G can be computed by an elementary greedy procedure called Greedy Hierarchical Algorithm. While the second conjecture clearly implies the first one, perhaps surprisingly we prove their equivalence. We support these equivalent conjectures by giving a proof for the special case where all input strings have length at most 3 (which until recently had been the only case where the Greedy Conjecture was proven). We also tested our conjectures on millions of instances of SCS. We prove that the standard Greedy Conjecture implies Greedy Hierarchical Conjecture, while the latter is sufficient for an efficient greedy 2-approximate approximation of SCS. Except for its (conjectured) good approximation ratio, the Greedy Hierarchical Algorithm provably finds a 3.5-approximation, and finds exact solutions for the special cases where we know polynomial time (not greedy) exact algorithms: (1) when the input strings form a spectrum of a string (2) when all input strings have length at most 2

    On the Greedy Algorithm for the Shortest Common Superstring Problem with Reversals

    Full text link
    We study a variation of the classical Shortest Common Superstring (SCS) problem in which a shortest superstring of a finite set of strings SS is sought containing as a factor every string of SS or its reversal. We call this problem Shortest Common Superstring with Reversals (SCS-R). This problem has been introduced by Jiang et al., who designed a greedy-like algorithm with length approximation ratio 44. In this paper, we show that a natural adaptation of the classical greedy algorithm for SCS has (optimal) compression ratio 12\frac12, i.e., the sum of the overlaps in the output string is at least half the sum of the overlaps in an optimal solution. We also provide a linear-time implementation of our algorithm.Comment: Published in Information Processing Letter

    On improving the approximation ratio of the r-shortest common superstring problem

    Full text link
    The Shortest Common Superstring problem (SCS) consists, for a set of strings S = {s_1,...,s_n}, in finding a minimum length string that contains all s_i, 1<= i <= n, as substrings. While a 2+11/30 approximation ratio algorithm has recently been published, the general objective is now to break the conceptual lower bound barrier of 2. This paper is a step ahead in this direction. Here we focus on a particular instance of the SCS problem, meaning the r-SCS problem, which requires all input strings to be of the same length, r. Golonev et al. proved an approximation ratio which is better than the general one for r<= 6. Here we extend their approach and improve their approximation ratio, which is now better than the general one for r<= 7, and less than or equal to 2 up to r = 6

    Parallel and sequential approximation of shortest superstrings

    Full text link

    A 2-2/3 Approximation for the Shortest Superstring Problem

    Get PDF
    Given a collection of strings S={s_1, ..., s_n} over an alphabet \Sigma, a superstring \alpha of S is a string containing each s_i as a substring; that is, for each i, 1\u3c=i\u3c=n, \alpha contains a block of |s_i| consecutive characters that match s_i exactly. The shortest superstring problem is the problem of finding a superstring \alpha of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS \u2780). Much of the recent interest in the problem is due to its application to DNA sequence assembly. The problem has been shown to be NP-hard; in fact, it was shown by Blum et al.(JACM \u2794) to be MAX SNP-hard. The first O(1)-approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3/4-approximation (WADS \u2795). We present our new algorithm, G-ShortString, which achieves a ratio of 2 2/3. It generalizes the ShortString algorithm, but the analysis differs substantially from that of ShortString. Our previous work identified classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to descibe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result
    • …
    corecore