49 research outputs found

    Greedy Shortest Common Superstring Approximation in Compact Space

    Get PDF
    Given a set of strings, the shortest common superstring problem is to find the shortest possible string that contains all the input strings. The problem is NP-hard, but a lot of work has gone into designing approximation algorithms for solving the problem. We present the first time and space efficient implementation of the classic greedy heuristic which merges strings in decreasing order of overlap length. Our implementation works in O(n log σ) time and bits of space, where n is the total length of the input strings in characters, and σσ is the size of the alphabet. After index construction, a practical implementation of our algorithm uses roughly 5n log σ bits of space and reasonable time for a real dataset that consists of DNA fragments.Peer reviewe

    Collapsing Superstring Conjecture

    Get PDF
    In the Shortest Common Superstring (SCS) problem, one is given a collection of strings, and needs to find a shortest string containing each of them as a substring. SCS admits 2 11/23-approximation in polynomial time (Mucha, SODA\u2713). While this algorithm and its analysis are technically involved, the 30 years old Greedy Conjecture claims that the trivial and efficient Greedy Algorithm gives a 2-approximation for SCS. We develop a graph-theoretic framework for studying approximation algorithms for SCS. The framework is reminiscent of the classical 2-approximation for Traveling Salesman: take two copies of an optimal solution, apply a trivial edge-collapsing procedure, and get an approximate solution. In this framework, we observe two surprising properties of SCS solutions, and we conjecture that they hold for all input instances. The first conjecture, that we call Collapsing Superstring conjecture, claims that there is an elementary way to transform any solution repeated twice into the same graph G. This conjecture would give an elementary 2-approximate algorithm for SCS. The second conjecture claims that not only the resulting graph G is the same for all solutions, but that G can be computed by an elementary greedy procedure called Greedy Hierarchical Algorithm. While the second conjecture clearly implies the first one, perhaps surprisingly we prove their equivalence. We support these equivalent conjectures by giving a proof for the special case where all input strings have length at most 3 (which until recently had been the only case where the Greedy Conjecture was proven). We also tested our conjectures on millions of instances of SCS. We prove that the standard Greedy Conjecture implies Greedy Hierarchical Conjecture, while the latter is sufficient for an efficient greedy 2-approximate approximation of SCS. Except for its (conjectured) good approximation ratio, the Greedy Hierarchical Algorithm provably finds a 3.5-approximation, and finds exact solutions for the special cases where we know polynomial time (not greedy) exact algorithms: (1) when the input strings form a spectrum of a string (2) when all input strings have length at most 2

    Superstrings with multiplicities

    Get PDF
    A superstring of a set of words P = s1, · · · , sp is a string that contains each word of P as substring. Given P, the well known Shortest Linear Superstring problem (SLS), asks for a shortest superstring of P. In a variant of SLS, called Multi-SLS, each word si comes with an integer m(i), its multiplicity, that sets a constraint on its number of occurrences, and the goal is to find a shortest superstring that contains at least m(i) occurrences of si. Multi-SLS generalizes SLS and is obviously as hard to solve, but it has been studied only in special cases (with words of length 2 or with a fixed number of words). The approximability of Multi-SLS in the general case remains open. Here, we study the approximability of Multi-SLS and that of the companion problem Multi-SCCS, which asks for a shortest cyclic cover instead of shortest superstring. First, we investigate the approximation of a greedy algorithm for maximizing the compression offered by a superstring or by a cyclic cover: the approximation ratio is 1/2 for Multi-SLS and 1 for Multi-SCCS. Then, we exhibit a linear time approximation algorithm, Concat-Greedy, and show it achieves a ratio of 4 regarding the superstring length. This demonstrates that for both measures Multi-SLS belongs to the class of APX problems. © 2018 Yoshifumi Sakai; licensed under Creative Commons License CC-BY.Peer reviewe

    Approximation algorithms for the shortest common superstring problem

    Get PDF
    AbstractThe object of the shortest common superstring problem (SCS) is to find the shortest possible string that contains every string in a given set as substrings. As the problem is NP-complete, approximation algorithms are of interest. The value of an aproximate solution to SCS is normally taken to be its length, and we seek algorithms that make the length as small as possible. A different measure is given by the sum of the overlaps between consecutive strings in a candidate solution. When considering this measure, the object is to find solutions that make it as large as possible. These two measures offer different ways of viewing the problem. While the two viewpoints are equivalent with respect to optimal solutions, they differ with respect to approximate solutions. We describe several approximation algorithms that produce solutions that are always within a factor of two of optimum with respect to the overlap measure. We also describe an efficient implementation of one of these, using McCreight's compact suffix tree construction algorithm. The worstcase running time is O(m log n) for small alphabets, where m is the sum of the lengths of all the strings in the set and n is the number of strings. For large alphabets, the algorithm can be implemented in O(m log m) time by using Sleator and Tarjan's lexicographic splay tree data structure

    Approximating (k,)(k,\ell)-center clustering for curves

    Get PDF
    The Euclidean kk-center problem is a classical problem that has been extensively studied in computer science. Given a set G\mathcal{G} of nn points in Euclidean space, the problem is to determine a set C\mathcal{C} of kk centers (not necessarily part of G\mathcal{G}) such that the maximum distance between a point in G\mathcal{G} and its nearest neighbor in C\mathcal{C} is minimized. In this paper we study the corresponding (k,)(k,\ell)-center problem for polygonal curves under the Fr\'echet distance, that is, given a set G\mathcal{G} of nn polygonal curves in Rd\mathbb{R}^d, each of complexity mm, determine a set C\mathcal{C} of kk polygonal curves in Rd\mathbb{R}^d, each of complexity \ell, such that the maximum Fr\'echet distance of a curve in G\mathcal{G} to its closest curve in C\mathcal{C} is minimized. In this paper, we substantially extend and improve the known approximation bounds for curves in dimension 22 and higher. We show that, if \ell is part of the input, then there is no polynomial-time approximation scheme unless P=NP\mathsf{P}=\mathsf{NP}. Our constructions yield different bounds for one and two-dimensional curves and the discrete and continuous Fr\'echet distance. In the case of the discrete Fr\'echet distance on two-dimensional curves, we show hardness of approximation within a factor close to 2.5982.598. This result also holds when k=1k=1, and the NP\mathsf{NP}-hardness extends to the case that =\ell=\infty, i.e., for the problem of computing the minimum-enclosing ball under the Fr\'echet distance. Finally, we observe that a careful adaptation of Gonzalez' algorithm in combination with a curve simplification yields a 33-approximation in any dimension, provided that an optimal simplification can be computed exactly. We conclude that our approximation bounds are close to being tight.Comment: 24 pages; results on minimum-enclosing ball added, additional author added, general revisio

    Study of the NP-completeness of the compact table problem

    No full text
    ISBN 978-5-94057-377-7International audienceThe problem of compact tables is to maximise the overlap when building a word that is to include permutations of every given words (all the words being the same length). This problem is shown to be NP-complete in the general case, and some specific restrictions are studied

    Compressed multiple pattern matching

    Get PDF
    Peer reviewe
    corecore