74 research outputs found
Towards a better solution to the shortest common supersequence problem: the deposition and reduction algorithm
BACKGROUND: The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences. RESULTS: In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances. CONCLUSION: Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences
Weighted Shortest Common Supersequence problem revisited
A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings (Formula presented) and (Formula presented) and a threshold (Formula presented) on probability, and we are asked to compute the shortest (standard) string S such that both (Formula presented) and (Formula presented) match subsequences of S (not necessarily the same
Analysis of the Relationships among Longest Common Subsequences, Shortest Common Supersequences and Patterns and its application on Pattern Discovery in Biological Sequences
For a set of mulitple sequences, their patterns,Longest Common Subsequences
(LCS) and Shortest Common Supersequences (SCS) represent different aspects of
these sequences profile, and they can all be used for biological sequence
comparisons and analysis. Revealing the relationship between the patterns and
LCS,SCS might provide us with a deeper view of the patterns of biological
sequences, in turn leading to better understanding of them. However, There is
no careful examinaton about the relationship between patterns, LCS and SCS. In
this paper, we have analyzed their relation, and given some lemmas. Based on
their relations, a set of algorithms called the PALS (PAtterns by Lcs and Scs)
algorithms are propsoed to discover patterns in a set of biological sequences.
These algorithms first generate the results for LCS and SCS of sequences by
heuristic, and consequently derive patterns from these results. Experiments
show that the PALS algorithms perform well (both in efficiency and in accuracy)
on a variety of sequences. The PALS approach also provides us with a solution
for transforming between the heuristic results of SCS and LCS.Comment: Extended version of paper presented in IEEE BIBE 2006 submitted to
journal for revie
Subsequences and Supersequences of Strings
Stringology - the study of strings - is a branch of algorithmics which been the sub-ject of mounting interest in recent years. Very recently, two books [M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, 1995] and [G. Stephen, String Searching Algorithms, World Scientific, 1994] have been published on the subject and at least two others are known to be in preparation. Problems on strings arise in information retrieval, version control, automatic spelling correction, and many other domains. However the greatest motivation for recent work in stringology has come from the field of molecular biology. String problems occur, for example, in genetic sequence construction, genetic sequence comparison, and phylogenetic tree construction. In this thesis we study a variety of string problems from a theoretical perspective. In particular, we focus on problems involving subsequences and supersequences of strings
- …