61,468 research outputs found
Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index
Finding approximate occurrences of a pattern in a text using a full-text
index is a central problem in bioinformatics and has been extensively
researched. Bidirectional indices have opened new possibilities in this regard
allowing the search to start from anywhere within the pattern and extend in
both directions. In particular, use of search schemes (partitioning the pattern
and searching the pieces in certain orders with given bounds on errors) can
yield significant speed-ups. However, finding optimal search schemes is a
difficult combinatorial optimization problem.
Here for the first time, we propose a mixed integer program (MIP) capable to
solve this optimization problem for Hamming distance with given number of
pieces. Our experiments show that the optimal search schemes found by our MIP
significantly improve the performance of search in bidirectional FM-index upon
previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina
reads (with two errors) becomes 35 times faster than standard backtracking.
Moreover, despite being performed purely in the index, the running time of
search using our optimal schemes (for up to two errors) is comparable to the
best state-of-the-art aligners, which benefit from combining search in index
with in-text verification using dynamic programming. As a result, we anticipate
a full-fledged aligner that employs an intelligent combination of search in the
bidirectional FM-index using our optimal search schemes and in-text
verification using dynamic programming outperforms today's best aligners. The
development of such an aligner, called FAMOUS (Fast Approximate string Matching
using OptimUm search Schemes), is ongoing as our future work
Faster Approximate Pattern Matching: {A} Unified Approach
Approximate pattern matching is a natural and well-studied problem on strings: Given a text , a pattern , and a threshold , find (the starting positions of) all substrings of that are at distance at most from . We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of that have at most mismatches with , while under the edit distance, we search for substrings of that can be transformed to with at most edits. Exact occurrences of in have a very simple structure: If we assume for simplicity that and trim so that occurs both as a prefix and as a suffix of , then both and are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are -mismatch occurrences of in , or both and are at Hamming distance from strings with a common period . We tighten this characterization by showing that there are -mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of -edit occurrences by in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where and are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18])
Faster Approximate Pattern Matching: A Unified Approach
Approximate pattern matching is a natural and well-studied problem on
strings: Given a text , a pattern , and a threshold , find (the
starting positions of) all substrings of that are at distance at most
from . We consider the two most fundamental string metrics: the Hamming
distance and the edit distance. Under the Hamming distance, we search for
substrings of that have at most mismatches with , while under the
edit distance, we search for substrings of that can be transformed to
with at most edits.
Exact occurrences of in have a very simple structure: If we assume
for simplicity that and trim so that occurs both as a
prefix and as a suffix of , then both and are periodic with a common
period. However, an analogous characterization for the structure of occurrences
with up to mismatches was proved only recently by Bringmann et al.
[SODA'19]: Either there are -mismatch occurrences of in , or
both and are at Hamming distance from strings with a common
period . We tighten this characterization by showing that there are
-mismatch occurrences in the case when the pattern is not
(approximately) periodic, and we lift it to the edit distance setting, where we
tightly bound the number of -edit occurrences by in the
non-periodic case. Our proofs are constructive and let us obtain a unified
framework for approximate pattern matching for both considered distances. We
showcase the generality of our framework with results for the fully-compressed
setting (where and are given as a straight-line program) and for the
dynamic setting (where we extend a data structure of Gawrychowski et al.
[SODA'18]).Comment: 74 pages, 7 figures, FOCS'2
Evaluation of approximate pattern matching algorithms for OCR texts
In recent years there has been going on a large process of digitising old books, articles and
newspapers. These documents are scanned and then processed with Optical Character Recognition
(OCR) software to obtain their text equivalent. However, due to the (usually) poor quality of the
original papers, the OCR software produces text which is not 100% accurate. A simple search for a
pattern in the resulting text would only retrieve those occurrences that were accurately interpreted, but
will ignore incorrectly spelled or distorted variations. In this paper we make use of the recently
devised algorithm by Christodoulakis and Brey (2008), on the edit distance with combinations and
splits, to perform approximate pattern matching for OCR texts. We then compare its performance
against classic generalpurpose
approximate matching algorithms
Improved Periodicity Mining in Time Series Databases
Time series data represents information about real world phenomena and periodicity mining explores the interesting periodic behavior that is inherent in the data. Periodicity mining has numerous applications such as in weather forecasting, stock market prediction and analysis, pattern recognition, etc. Recently, the suffix tree, a powerful data structure that efficiently solves many strings related problems has been used to gather information about repeated substrings in the text and then perform periodicity mining. However, periodicity mining deals with large amounts of data which makes it difficult to perform mining in main memory due to the space constraints of the suffix tree. Thus, we first propose the use of the Compressed Suffix Tree (CST) for space efficient periodicity mining in very large datasets. Given the time-space trade-off that comes with any practical usage of the CST, we provide a comprehensive empirical analysis on the practical usage of CSTs and traditional suffix trees for periodicity mining.;Noise is an inherent part of practical time series data, and it is important to mine periods in spite of the noise. This leads to the problem of approximate periodicity mining. Existing algorithms have dealt with the noise introduced between the occurrences of the periodic pattern, but not the noise introduced in the structure of the pattern itself. We present a taxonomy for approximate periodicity and then propose an algorithm that performs periodicity mining in the presence of noise introduced simultaneously in both the structure of the pattern and between the periodic occurrences of the pattern
Linear Algorithm for Conservative Degenerate Pattern Matching
A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a
sequence of such symbols is a degenerate string. A degenerate string is said to
be conservative if its number of non-solid symbols is upper-bounded by a fixed
positive constant k. We consider here the matching problem of conservative
degenerate strings and present the first linear-time algorithm that can find,
for given degenerate strings P* and T* of total length n containing k non-solid
symbols in total, the occurrences of P* in T* in O(nk) time
The streaming -mismatch problem
We consider the streaming complexity of a fundamental task in approximate
pattern matching: the -mismatch problem. It asks to compute Hamming
distances between a pattern of length and all length- substrings of a
text for which the Hamming distance does not exceed a given threshold . In
our problem formulation, we report not only the Hamming distance but also, on
demand, the full \emph{mismatch information}, that is the list of mismatched
pairs of symbols and their indices. The twin challenges of streaming pattern
matching derive from the need both to achieve small working space and also to
guarantee that every arriving input symbol is processed quickly.
We present a streaming algorithm for the -mismatch problem which uses
bits of space and spends \ourcomplexity time on
each symbol of the input stream, which consists of the pattern followed by the
text. The running time almost matches the classic offline solution and the
space usage is within a logarithmic factor of optimal.
Our new algorithm therefore effectively resolves and also extends an open
problem first posed in FOCS'09. En route to this solution, we also give a
deterministic -bit encoding of all
the alignments with Hamming distance at most of a length- pattern within
a text of length . This secondary result provides an optimal solution to
a natural communication complexity problem which may be of independent
interest.Comment: 27 page
Elastic-Degenerate String Matching with 1 Error
An elastic-degenerate string is a sequence of finite sets of strings of
total length , introduced to represent a set of related DNA sequences, also
known as a pangenome. The ED string matching (EDSM) problem consists in
reporting all occurrences of a pattern of length in an ED text. This
problem has recently received some attention by the combinatorial pattern
matching community, culminating in an
-time algorithm [Bernardini
et al., SIAM J. Comput. 2022], where denotes the matrix multiplication
exponent and the notation suppresses polylog
factors. In the -EDSM problem, the approximate version of EDSM, we are asked
to report all pattern occurrences with at most errors. -EDSM can be
solved in time, under edit distance, or
time, under Hamming distance, where denotes the total
number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020].
Unfortunately, is only bounded by , and so even for , the existing
algorithms run in time in the worst case. In this paper we show
that -EDSM can be solved in or
time under edit distance. For the decision version, we
present a faster -time algorithm.
We also show that -EDSM can be solved in time
under Hamming distance. Our algorithms for edit distance rely on non-trivial
reductions from -EDSM to special instances of classic computational geometry
problems (2d rectangle stabbing or 2d range emptiness), which we show how to
solve efficiently. In order to obtain an even faster algorithm for Hamming
distance, we rely on employing and adapting the -errata trees for indexing
with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202
- …