18 research outputs found
Approximation algorithms for the shortest common superstring problem
AbstractThe object of the shortest common superstring problem (SCS) is to find the shortest possible string that contains every string in a given set as substrings. As the problem is NP-complete, approximation algorithms are of interest. The value of an aproximate solution to SCS is normally taken to be its length, and we seek algorithms that make the length as small as possible. A different measure is given by the sum of the overlaps between consecutive strings in a candidate solution. When considering this measure, the object is to find solutions that make it as large as possible. These two measures offer different ways of viewing the problem. While the two viewpoints are equivalent with respect to optimal solutions, they differ with respect to approximate solutions. We describe several approximation algorithms that produce solutions that are always within a factor of two of optimum with respect to the overlap measure. We also describe an efficient implementation of one of these, using McCreight's compact suffix tree construction algorithm. The worstcase running time is O(m log n) for small alphabets, where m is the sum of the lengths of all the strings in the set and n is the number of strings. For large alphabets, the algorithm can be implemented in O(m log m) time by using Sleator and Tarjan's lexicographic splay tree data structure
Recommended from our members
Algorithms for constructing a consensus sequence
Biological and physical limitations require that DNA be sequenced in fragments. There are several approaches to obtain the appropriate sized fragments of DNA to sequence. The method of sequencing that we are interested in is loosely referred to as shotgun sequencing. Many copies of the genomic DNA to be sequenced are cleaved by one or more restriction endonucleases resulting in a multiset, S, of DNA fragments that are not ordered. DNA fragments are essentially selected at random from this multset and sequenced. A consensus sequence is constructed by joining together fragments which overlap. (One hopes that the consensus sequence is very close to the original sequence.) Since errors occur reading the sequences, the overlaps must be approximate, not exact.
This process of reassembly is similar to the NP-complete shortest common superstring problem [GMS80]. To simplify the problem we make the following assumptions.
• An integer k can be supplied that defines the minimum acceptable overlap between two sequences.
• There is a unique alignment of the sequence fragments such that all suf- fix/prefix overlaps are of length k or greater.
• All suffix/prefix overlaps are exact (log inexact) matches.
We define the string consensus problem and give three algorithms to solve it. We then define the log inexact string consensus problem and give three algorithms to solve it. We believe that the log inexact string consensus problem is closer to the problem of constructing a consensus sequence from shotgun data that biochemists are trying to solve than the problems previous approximation algorithms for the shortest common superstring problem
Computational Molecular Biology
Computational Biology is a fairly new subject that arose in response to the computational problems posed by the analysis and the processing of biomolecular sequence and structure data. The field was initiated in the late 60's and early 70's largely by pioneers working in the life sciences. Physicists and mathematicians entered the field in the 70's and 80's, while Computer Science became involved with the new biological problems in the late 1980's. Computational problems have gained further importance in molecular biology through the various genome projects which produce enormous amounts of data. For this bibliography we focus on those areas of computational molecular biology that involve discrete algorithms or discrete optimization. We thus neglect several other areas of computational molecular biology, like most of the literature on the protein folding problem, as well as databases for molecular and genetic data, and genetic mapping algorithms. Due to the availability of review papers and a bibliography this bibliography