Search CORE

4,779 research outputs found

Approximate String Matching Using a Bidirectional Index

Author: B. Langmead
D. Belazzougui
H. Li
H. Li
J.T. Simpson
L.M.S. Russo
T. Schnattinger
T.-W. Lam
W.-K. Sung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceWe study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of [5]. We introduce a formalism, called search schemes, to specify search strate-gies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

Author: Kianfar Kiavash
Luo Haochen
Pockrandt Christopher
Reinert Knut
Torkamandi Bahman
Publication venue
Publication date: 05/03/2018
Field of study

Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

arXiv.org e-Print Archive

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

Author: Torkamandi Bahman
Publication venue
Publication date: 07/01/2021
Field of study

The objective of the research in this dissertation is to derive optimal search schemes for approximate string matching using bidirectional FM-index, and utilize them in increasing the speed of such searches. Such a problem arises in computer science with many applications. Approximate string matching problem is also central in bioinformatics where biologists are interested in aligning pieces of DNA back to genome. Given a text, the search for a given pattern can be accelerated by preprocessing the text through constructing a hash table or indexing the text. Bidirectional indices have opened new possibilities by allowing a search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Prior work tends to use search heuristics but lacks the ability to find the best strategies for using an index to search for a pattern. In this dissertation, we will find the optimal search scheme for approximate string matching problem for a bidirectional index with the assumption of having the number of partitions. Moreover, we will investigate the computational gain from applying these optimal search schemes to search in a bidirectional FM-index. Intellectual Merit. First, we propose an MIP formulation to find the optimal search scheme for approximate string matching problem using a bidirectional index under Hamming distance error. Second, we demonstrate that our MIP can solve the optimum search scheme problem to optimality in a reasonable amount of time for input parameters of considerable size, and enjoys very quick convergence to optimal or near-optimal solutions for input parameters of larger size. Third, we show that approximate search in a bidirectional FM-index can be performed significantly faster if the optimal schemes obtained from our MIP are used. This is demonstrated based on number of edges in the search tries as well as actual running time of in-index search for Illumina DNA Sequencing reads (up to 35 times faster than standard backtracking for 3 errors). Although our MIP solutions are for Hamming distance, they perform equally well for edit distance. Fourth, we demonstrate that our optimal search schemes is superior to the best of in-index aligners for 2 and 3 errors. In an attempt to acquire a glimpse of the potential of combining our optimal search schemes with in-text verification, we combine optimal search scheme and in-text verification for Hamming distance. This experiment halved the running time for reads of size 101 and 125. Furthermore, we showcase the power of our optimal search schemes by demonstrating that for 1 to 3 errors, approximate string matching of reads of size 40, 101, and 125 performed completely in index compete in running time with the best full-fledged aligners, which benefit from combining search in index with in-text verification for edit distance. Moreover, we will relax the assumption of having equal size partitions in our MIP and address the more general form of approximate string matching problem where the only assumption is the prespecified number of partitions. We will present an MIP formulation for edit distance and provide an alternative formulation for Hamming distance. Broader Impacts. The results of this research promise a significant increase in speed of finding approximate occurrences of a pattern in a text. This is an important problem with many applications in bioinformatics and computer science such as recovering text in signal processing and information retrieval [23]. Approximate string matching plays an indisputable role in the realm of bioinformatics, where any downstream analysis on the genomic data starts with aligning sequenced DNA or RNA reads back to a reference genome. Technologies such as next generation sequencing has produced considerable amount of data leading to increasing demand for fast read aligners to map DNA pieces to genome. In order to solve this central problem, one could consider the genome of any species of interest as the "text" and the sequenced pieces of DNA as the "patterns" and therefore search for approximate occurrences of a pattern in a text using a full-text index. Some tolerance for errors is required due to mutations in genome of each individual organism such as single nucleotide variants (SNVs) as well as errors in sequencing technologies. This broad spectrum of applications indicates the significant impact of this research on many areas of health and life sciences and practice, where discovery, diagnosis, and treatment all depend on genome sequencing

Texas A&M Repository

Optimal-Time Text Indexing in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 11/07/2017
Field of study

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is

r

, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used

O(r)

space and was able to efficiently count the number of occurrences of a pattern of length

m

in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of

r

. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the

occ

occurrences efficiently within

O(r)

space (in loglogarithmic time each), and reaching optimal time

O(m+occ)

within

O(r\log(n/r))

space, on a RAM machine of

w=\Omega(\log n)

bits. Within

O(r\log (n/r))

space, our index can also count in optimal time

O(m)

. Raising the space to

O(r w\log_\sigma(n/r))

, we support count and locate in

O(m\log(\sigma)/w)

and

O(m\log(\sigma)/w+occ)

time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using

O(r\log(n/r))

space that replaces the text and extracts any text substring of length

\ell

in almost-optimal time

O(\log(n/r)+\ell\log(\sigma)/w)

. (...continues...

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Lossless seeds for searching short patterns with high error rates

Author: Salson Mikaël
Touzet Hélène
Vroland Christophe
Publication venue: HAL CCSD
Publication date: 01/10/2014
Field of study

International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

HAL - Lille 3

INRIA a CCSD electronic archive server