2,953 research outputs found
Linear Algorithm for Conservative Degenerate Pattern Matching
A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a
sequence of such symbols is a degenerate string. A degenerate string is said to
be conservative if its number of non-solid symbols is upper-bounded by a fixed
positive constant k. We consider here the matching problem of conservative
degenerate strings and present the first linear-time algorithm that can find,
for given degenerate strings P* and T* of total length n containing k non-solid
symbols in total, the occurrences of P* in T* in O(nk) time
A practical index for approximate dictionary matching with few mismatches
Approximate dictionary matching is a classic string matching problem
(checking if a query string occurs in a collection of strings) with
applications in, e.g., spellchecking, online catalogs, geolocation, and web
searchers. We present a surprisingly simple solution called a split index,
which is based on the Dirichlet principle, for matching a keyword with few
mismatches, and experimentally show that it offers competitive space-time
tradeoffs. Our implementation in the C++ language is focused mostly on data
compaction, which is beneficial for the search speed (e.g., by being cache
friendly). We compare our solution with other algorithms and we show that it
performs better for the Hamming distance. Query times in the order of 1
microsecond were reported for one mismatch for the dictionary size of a few
megabytes on a medium-end PC. We also demonstrate that a basic compression
technique consisting in -gram substitution can significantly reduce the
index size (up to 50% of the input text size for the DNA), while still keeping
the query time relatively low
The k-mismatch problem revisited
We revisit the complexity of one of the most basic problems in pattern
matching. In the k-mismatch problem we must compute the Hamming distance
between a pattern of length m and every m-length substring of a text of length
n, as long as that Hamming distance is at most k. Where the Hamming distance is
greater than k at some alignment of the pattern and text, we simply output
"No".
We study this problem in both the standard offline setting and also as a
streaming problem. In the streaming k-mismatch problem the text arrives one
symbol at a time and we must give an output before processing any future
symbols. Our main results are as follows:
1) Our first result is a deterministic time offline algorithm for k-mismatch on a text of length n. This is a
factor of k improvement over the fastest previous result of this form from SODA
2000 by Amihood Amir et al.
2) We then give a randomised and online algorithm which runs in the same time
complexity but requires only space in total.
3) Next we give a randomised -approximation algorithm for the
streaming k-mismatch problem which uses
space and runs in worst-case time per
arriving symbol.
4) Finally we combine our new results to derive a randomised
space algorithm for the streaming k-mismatch problem
which runs in worst-case time per
arriving symbol. This improves the best previous space complexity for streaming
k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We
also improve the time complexity of this previous result by an even greater
factor to match the fastest known offline algorithm (up to logarithmic
factors)
Efficient Computation of Sequence Mappability
Sequence mappability is an important task in genome re-sequencing. In the
-mappability problem, for a given sequence of length , our goal
is to compute a table whose th entry is the number of indices such
that length- substrings of starting at positions and have at
most mismatches. Previous works on this problem focused on heuristic
approaches to compute a rough approximation of the result or on the case of
. We present several efficient algorithms for the general case of the
problem. Our main result is an algorithm that works in time and space for
. It requires a carefu l adaptation of the technique of Cole
et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also
show -time algorithms to compute all results for a fixed
and all or a fixed and all . Finally we show
that the -mappability problem cannot be solved in strongly subquadratic
time for unless the Strong Exponential Time Hypothesis
fails.Comment: Accepted to SPIRE 201
The streaming -mismatch problem
We consider the streaming complexity of a fundamental task in approximate
pattern matching: the -mismatch problem. It asks to compute Hamming
distances between a pattern of length and all length- substrings of a
text for which the Hamming distance does not exceed a given threshold . In
our problem formulation, we report not only the Hamming distance but also, on
demand, the full \emph{mismatch information}, that is the list of mismatched
pairs of symbols and their indices. The twin challenges of streaming pattern
matching derive from the need both to achieve small working space and also to
guarantee that every arriving input symbol is processed quickly.
We present a streaming algorithm for the -mismatch problem which uses
bits of space and spends \ourcomplexity time on
each symbol of the input stream, which consists of the pattern followed by the
text. The running time almost matches the classic offline solution and the
space usage is within a logarithmic factor of optimal.
Our new algorithm therefore effectively resolves and also extends an open
problem first posed in FOCS'09. En route to this solution, we also give a
deterministic -bit encoding of all
the alignments with Hamming distance at most of a length- pattern within
a text of length . This secondary result provides an optimal solution to
a natural communication complexity problem which may be of independent
interest.Comment: 27 page
- …