61,468 research outputs found

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    Faster Approximate Pattern Matching: {A} Unified Approach

    Get PDF
    Approximate pattern matching is a natural and well-studied problem on strings: Given a text TT, a pattern PP, and a threshold kk, find (the starting positions of) all substrings of TT that are at distance at most kk from PP. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of TT that have at most kk mismatches with PP, while under the edit distance, we search for substrings of TT that can be transformed to PP with at most kk edits. Exact occurrences of PP in TT have a very simple structure: If we assume for simplicity that T3P/2|T| \le 3|P|/2 and trim TT so that PP occurs both as a prefix and as a suffix of TT, then both PP and TT are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to kk mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are O(k2)O(k^2) kk-mismatch occurrences of PP in TT, or both PP and TT are at Hamming distance O(k)O(k) from strings with a common period O(m/k)O(m/k). We tighten this characterization by showing that there are O(k)O(k) kk-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of kk-edit occurrences by O(k2)O(k^2) in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where TT and PP are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18])

    Faster Approximate Pattern Matching: A Unified Approach

    Get PDF
    Approximate pattern matching is a natural and well-studied problem on strings: Given a text TT, a pattern PP, and a threshold kk, find (the starting positions of) all substrings of TT that are at distance at most kk from PP. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of TT that have at most kk mismatches with PP, while under the edit distance, we search for substrings of TT that can be transformed to PP with at most kk edits. Exact occurrences of PP in TT have a very simple structure: If we assume for simplicity that T3P/2|T| \le 3|P|/2 and trim TT so that PP occurs both as a prefix and as a suffix of TT, then both PP and TT are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to kk mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are O(k2)O(k^2) kk-mismatch occurrences of PP in TT, or both PP and TT are at Hamming distance O(k)O(k) from strings with a common period O(m/k)O(m/k). We tighten this characterization by showing that there are O(k)O(k) kk-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of kk-edit occurrences by O(k2)O(k^2) in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where TT and PP are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]).Comment: 74 pages, 7 figures, FOCS'2

    Evaluation of approximate pattern matching algorithms for OCR texts

    Get PDF
    In recent years there has been going on a large process of digitising old books, articles and newspapers. These documents are scanned and then processed with Optical Character Recognition (OCR) software to obtain their text equivalent. However, due to the (usually) poor quality of the original papers, the OCR software produces text which is not 100% accurate. A simple search for a pattern in the resulting text would only retrieve those occurrences that were accurately interpreted, but will ignore incorrectly spelled or distorted variations. In this paper we make use of the recently devised algorithm by Christodoulakis and Brey (2008), on the edit distance with combinations and splits, to perform approximate pattern matching for OCR texts. We then compare its performance against classic generalpurpose approximate matching algorithms

    Improved Periodicity Mining in Time Series Databases

    Get PDF
    Time series data represents information about real world phenomena and periodicity mining explores the interesting periodic behavior that is inherent in the data. Periodicity mining has numerous applications such as in weather forecasting, stock market prediction and analysis, pattern recognition, etc. Recently, the suffix tree, a powerful data structure that efficiently solves many strings related problems has been used to gather information about repeated substrings in the text and then perform periodicity mining. However, periodicity mining deals with large amounts of data which makes it difficult to perform mining in main memory due to the space constraints of the suffix tree. Thus, we first propose the use of the Compressed Suffix Tree (CST) for space efficient periodicity mining in very large datasets. Given the time-space trade-off that comes with any practical usage of the CST, we provide a comprehensive empirical analysis on the practical usage of CSTs and traditional suffix trees for periodicity mining.;Noise is an inherent part of practical time series data, and it is important to mine periods in spite of the noise. This leads to the problem of approximate periodicity mining. Existing algorithms have dealt with the noise introduced between the occurrences of the periodic pattern, but not the noise introduced in the structure of the pattern itself. We present a taxonomy for approximate periodicity and then propose an algorithm that performs periodicity mining in the presence of noise introduced simultaneously in both the structure of the pattern and between the periodic occurrences of the pattern

    Linear Algorithm for Conservative Degenerate Pattern Matching

    Full text link
    A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a sequence of such symbols is a degenerate string. A degenerate string is said to be conservative if its number of non-solid symbols is upper-bounded by a fixed positive constant k. We consider here the matching problem of conservative degenerate strings and present the first linear-time algorithm that can find, for given degenerate strings P* and T* of total length n containing k non-solid symbols in total, the occurrences of P* in T* in O(nk) time

    The streaming kk-mismatch problem

    Get PDF
    We consider the streaming complexity of a fundamental task in approximate pattern matching: the kk-mismatch problem. It asks to compute Hamming distances between a pattern of length nn and all length-nn substrings of a text for which the Hamming distance does not exceed a given threshold kk. In our problem formulation, we report not only the Hamming distance but also, on demand, the full \emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the kk-mismatch problem which uses O(klognlognk)O(k\log{n}\log\frac{n}{k}) bits of space and spends \ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS'09. En route to this solution, we also give a deterministic O(k(lognk+logΣ))O( k (\log \frac{n}{k} + \log |\Sigma|) )-bit encoding of all the alignments with Hamming distance at most kk of a length-nn pattern within a text of length O(n)O(n). This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.Comment: 27 page

    Elastic-Degenerate String Matching with 1 Error

    Get PDF
    An elastic-degenerate string is a sequence of nn finite sets of strings of total length NN, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length mm in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω1)+O(N)\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω\omega denotes the matrix multiplication exponent and the O~()\tilde{\mathcal{O}}(\cdot) notation suppresses polylog factors. In the kk-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most kk errors. kk-EDSM can be solved in O(k2mG+kN)\mathcal{O}(k^2mG+kN) time, under edit distance, or O(kmG+kN)\mathcal{O}(kmG+kN) time, under Hamming distance, where GG denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, GG is only bounded by NN, and so even for k=1k=1, the existing algorithms run in Ω(mN)\Omega(mN) time in the worst case. In this paper we show that 11-EDSM can be solved in O((nm2+N)logm)\mathcal{O}((nm^2 + N)\log m) or O(nm3+N)\mathcal{O}(nm^3 + N) time under edit distance. For the decision version, we present a faster O(nm2logm+Nloglogm)\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)-time algorithm. We also show that 11-EDSM can be solved in O(nm2+Nlogm)\mathcal{O}(nm^2 + N\log m) time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 11-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the kk-errata trees for indexing with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202
    corecore