21,353 research outputs found
Pattern-matching with bounded gaps in genomic sequences
Recently, some pattern matching algorithms allowing gaps were introduced in Crochemore et al. [Approximate string matching with gaps. Nordic Journal of Computing, 9 (2002):54–65, 2002], where upper-bounded, strict-bounded and unbounded gaps were considered. In this paper we further extend these restrictions on the gaps to permit lower-bounded and (lower-upper)-bounded gaps that we simply refer to as (a,b)-bounded gaps. We give formal definitions for these problems as well as their respective algorithmic solutions
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
Palindromic Decompositions with Gaps and Errors
Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \delta-palindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201
Finding approximate palindromes in strings
We introduce a novel definition of approximate palindromes in strings, and
provide an algorithm to find all maximal approximate palindromes in a string
with up to errors. Our definition is based on the usual edit operations of
approximate pattern matching, and the algorithm we give, for a string of size
on a fixed alphabet, runs in time. We also discuss two
implementation-related improvements to the algorithm, and demonstrate their
efficacy in practice by means of both experiments and an average-case analysis
- …