Search CORE

61,468 research outputs found

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

Author: Kianfar Kiavash
Luo Haochen
Pockrandt Christopher
Reinert Knut
Torkamandi Bahman
Publication venue
Publication date: 05/03/2018
Field of study

Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

arXiv.org e-Print Archive

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Faster Approximate Pattern Matching: {A} Unified Approach

Author: Charalampopoulos P.
Kociumaka T.
Wellnitz P.
Publication venue
Publication date: 01/01/2020
Field of study

Approximate pattern matching is a natural and well-studied problem on strings: Given a text

T

, a pattern

P

, and a threshold

k

, find (the starting positions of) all substrings of

T

that are at distance at most

k

from

P

. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of

T

that have at most

k

mismatches with

P

, while under the edit distance, we search for substrings of

T

that can be transformed to

P

with at most

k

edits. Exact occurrences of

P

T

have a very simple structure: If we assume for simplicity that

|T| \le 3|P|/2

and trim

T

so that

P

occurs both as a prefix and as a suffix of

T

, then both

P

and

T

are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to

k

mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are

O(k^2)

k

-mismatch occurrences of

P

T

, or both

P

and

T

are at Hamming distance

O(k)

from strings with a common period

O(m/k)

. We tighten this characterization by showing that there are

O(k)

k

-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of

k

-edit occurrences by

O(k^2)

in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where

T

and

P

are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18])

MPG.PuRe

Faster Approximate Pattern Matching: A Unified Approach

Author: Charalampopoulos Panagiotis
Kociumaka Tomasz
Wellnitz Philip
Publication venue
Publication date: 01/01/2020
Field of study

Approximate pattern matching is a natural and well-studied problem on strings: Given a text

T

, a pattern

P

, and a threshold

k

, find (the starting positions of) all substrings of

T

that are at distance at most

k

from

P

. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of

T

that have at most

k

mismatches with

P

, while under the edit distance, we search for substrings of

T

that can be transformed to

P

with at most

k

edits. Exact occurrences of

P

T

have a very simple structure: If we assume for simplicity that

|T| \le 3|P|/2

and trim

T

so that

P

occurs both as a prefix and as a suffix of

T

, then both

P

and

T

are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to

k

mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are

O(k^2)

k

-mismatch occurrences of

P

T

, or both

P

and

T

are at Hamming distance

O(k)

from strings with a common period

O(m/k)

. We tighten this characterization by showing that there are

O(k)

k

-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of

k

-edit occurrences by

O(k^2)

T

and

P

are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]).Comment: 74 pages, 7 figures, FOCS'2

arXiv.org e-Print Archive

MPG.PuRe

Evaluation of approximate pattern matching algorithms for OCR texts

Author: Brey Gerhard
Brey Gerhard
Christodoulakis Manolis
Christodoulakis Manolis
Uppal Rizwan Ahmed
Uppal Rizwan Ahmed
Publication venue
Publication date: 01/01/2009
Field of study

In recent years there has been going on a large process of digitising old books, articles and newspapers. These documents are scanned and then processed with Optical Character Recognition (OCR) software to obtain their text equivalent. However, due to the (usually) poor quality of the original papers, the OCR software produces text which is not 100% accurate. A simple search for a pattern in the resulting text would only retrieve those occurrences that were accurately interpreted, but will ignore incorrectly spelled or distorted variations. In this paper we make use of the recently devised algorithm by Christodoulakis and Brey (2008), on the edit distance with combinations and splits, to perform approximate pattern matching for OCR texts. We then compare its performance against classic generalpurpose approximate matching algorithms

UEL Research Repository at University of East London

Improved Periodicity Mining in Time Series Databases

Author: Uppalapati Nithin
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2015
Field of study

Time series data represents information about real world phenomena and periodicity mining explores the interesting periodic behavior that is inherent in the data. Periodicity mining has numerous applications such as in weather forecasting, stock market prediction and analysis, pattern recognition, etc. Recently, the suffix tree, a powerful data structure that efficiently solves many strings related problems has been used to gather information about repeated substrings in the text and then perform periodicity mining. However, periodicity mining deals with large amounts of data which makes it difficult to perform mining in main memory due to the space constraints of the suffix tree. Thus, we first propose the use of the Compressed Suffix Tree (CST) for space efficient periodicity mining in very large datasets. Given the time-space trade-off that comes with any practical usage of the CST, we provide a comprehensive empirical analysis on the practical usage of CSTs and traditional suffix trees for periodicity mining.;Noise is an inherent part of practical time series data, and it is important to mine periods in spite of the noise. This leads to the problem of approximate periodicity mining. Existing algorithms have dealt with the noise introduced between the occurrences of the periodic pattern, but not the noise introduced in the structure of the pattern itself. We present a taxonomy for approximate periodicity and then propose an algorithm that performs periodicity mining in the presence of noise introduced simultaneously in both the structure of the pattern and between the periodic occurrences of the pattern

The Research Repository @ WVU (West Virginia University)

Linear Algorithm for Conservative Degenerate Pattern Matching

Author: Crochemore Maxime
Iliopoulos Costas S.
Kundu Ritu
Mohamed Manal
Vayani Fatima
Publication venue
Publication date: 15/06/2015
Field of study

A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a sequence of such symbols is a degenerate string. A degenerate string is said to be conservative if its number of non-solid symbols is upper-bounded by a fixed positive constant k. We consider here the matching problem of conservative degenerate strings and present the first linear-time algorithm that can find, for given degenerate strings P* and T* of total length n containing k non-solid symbols in total, the occurrences of P* in T* in O(nk) time

arXiv.org e-Print Archive

King's Research Portal

The streaming $k$ -mismatch problem

Author: Clifford Raphaël
Kociumaka Tomasz
Porat Ely
Publication venue
Publication date: 09/04/2018
Field of study

We consider the streaming complexity of a fundamental task in approximate pattern matching: the

k

-mismatch problem. It asks to compute Hamming distances between a pattern of length

n

and all length-

n

substrings of a text for which the Hamming distance does not exceed a given threshold

k

. In our problem formulation, we report not only the Hamming distance but also, on demand, the full \emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the

k

-mismatch problem which uses

O(k\log{n}\log\frac{n}{k})

bits of space and spends \ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS'09. En route to this solution, we also give a deterministic

O( k (\log \frac{n}{k} + \log |\Sigma|) )

-bit encoding of all the alignments with Hamming distance at most

k

of a length-

n

pattern within a text of length

O(n)

. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.Comment: 27 page

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Elastic-Degenerate String Matching with 1 Error

Author: Bernardini Giulia
Gabory Estéban
Pissis Solon P.
Stougie Leen
Sweering Michelle
Zuba Wiktor
Publication venue
Publication date: 01/01/2022
Field of study

An elastic-degenerate string is a sequence of

n

finite sets of strings of total length

N

, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length

m

in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an

\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)

-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where

\omega

denotes the matrix multiplication exponent and the

\tilde{\mathcal{O}}(\cdot)

notation suppresses polylog factors. In the

k

-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most

k

errors.

k

-EDSM can be solved in

\mathcal{O}(k^2mG+kN)

time, under edit distance, or

\mathcal{O}(kmG+kN)

time, under Hamming distance, where

G

denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately,

G

is only bounded by

N

, and so even for

k=1

, the existing algorithms run in

\Omega(mN)

time in the worst case. In this paper we show that

1

-EDSM can be solved in

\mathcal{O}((nm^2 + N)\log m)

\mathcal{O}(nm^3 + N)

time under edit distance. For the decision version, we present a faster

\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)

-time algorithm. We also show that

1

-EDSM can be solved in

\mathcal{O}(nm^2 + N\log m)

time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from

1

-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the

k

-errata trees for indexing with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server