Search CORE

32,456 research outputs found

Exact Online String Matching Bibliography

Author: Faro Simone
Publication venue
Publication date: 17/05/2016
Field of study

In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 120 string matching algorithms have been proposed. In this note we present a comprehensive list of (almost) all string matching algorithms. The list is updated to May 2016.Comment: 23 page

arXiv.org e-Print Archive

Fast Algorithms for Exact String Matching

Author: Divakaran Srikrishnan
Publication venue
Publication date: 30/09/2015
Field of study

Given a pattern string

P

of length

n

and a query string

T

of length

m

, where the characters of

P

and

T

are drawn from an alphabet of size

\Delta

, the {\em exact string matching} problem consists of finding all occurrences of

P

T

. For this problem, we present algorithms that in

O(n\Delta^2)

time pre-process

P

to essentially identify

sparse(P)

, a rarely occurring substring of

P

, and then use it to find occurrences of

P

T

efficiently. Our algorithms require a worst case search time of

O(m)

, and expected search time of

O(m/min(|sparse(P)|, \Delta))

, where

|sparse(P)|

is at least

\delta

(i.e. the number of distinct characters in

P

), and for most pattern strings it is observed to be

\Omega(n^{1/2})

arXiv.org e-Print Archive

A Fast Heuristic for Exact String Matching

Author: Divakaran Srikrishnan
Publication venue
Publication date: 10/12/2015
Field of study

Given a pattern string

P

of length

n

consisting of

\delta

distinct characters and a query string

T

of length

m

, where the characters of

P

and

T

are drawn from an alphabet

\Sigma

of size

\Delta

, the {\em exact string matching} problem consists of finding all occurrences of

P

T

. For this problem, we present a randomized heuristic that in

O(n\delta)

time preprocesses

P

to identify

sparse(P)

, a rarely occurring substring of

P

, and then use it to find all occurrences of

P

T

efficiently. This heuristic has an expected search time of

O( \frac{m}{min(|sparse(P)|, \Delta)})

, where

|sparse(P)|

is at least

\delta

. We also show that for a pattern string

P

whose characters are chosen uniformly at random from an alphabet of size

\Delta

E[|sparse(P)|]

\Omega(\Delta log (\frac{2\Delta}{2\Delta-\delta}))

.Comment: arXiv admin note: substantial text overlap with arXiv:1509.0922

arXiv.org e-Print Archive

On the Average-case Complexity of Pattern Matching with Wildcards

Author: Barton Carl
Publication venue
Publication date: 14/01/2016
Field of study

Pattern matching with wildcards is the problem of finding all factors of a text

t

of length

n

that match a pattern

x

of length

m

, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the \textit{average-case} complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in complexity between exact pattern matching and pattern matching with wildcards in the word RAM model

arXiv.org e-Print Archive

New algorithms for binary jumbled pattern matching

Author: Giaquinta Emanuele
Grabowski Szymon
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

Given a pattern

P

and a text

T

, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of

P

occurs in

T

. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are

O(n^2/\log^2 n)

(with O(n) space and O(1) query time) and

O(r^2\log r)

(with O(|L|) space and

O(\log|L|)

query time), where

r

is the length of the run-length encoding of

T

and

|L| = O(n)

is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. that obtains a trade-off between the space and the time complexity. It has

O(r^2\log k + n/k)

complexity to build the index,

O(\log k)

query time, and uses

O(n/k + |L|)

space, where

k

is a parameter. The second result is an

O(n^2 \log^2 w / w)

algorithm (with O(n) space and O(1) query time), based on word-level parallelism where

w

is the word size in bits

arXiv.org e-Print Archive

A Hybrid Parallel Implementation of the Aho-Corasick and Wu-Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database

Author: Assael John-Alexander M.
Kouzinopoulos Charalampos S.
Margaritis Konstantinos G.
Pyrgiotis Themistoklis K.
Publication venue
Publication date: 10/07/2014
Field of study

Multiple matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick and Wu-Manber, two of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be processed, as is common in computational biology applications. Over the past years, Graphics Processing Units (GPUs) have evolved to powerful parallel processors outperforming Central Processing Units (CPUs) in scientific calculations. Moreover, multiple GPUs can be used in parallel, forming hybrid computer cluster configurations to achieve an even higher processing throughput. This paper evaluates the speedup of the parallel implementation of the Aho-Corasick and Wu-Manber algorithms on a hybrid GPU cluster, when used to process a snapshot of the Expressed Sequence Tags of the human genome and for different problem parameters

arXiv.org e-Print Archive

New Error Tolerant Method to Search Long Repeats in Symbol Sequences

Author: Sadovsky Michael
Tsarev Sergey
Publication venue
Publication date: 05/04/2016
Field of study

A new method to identify all sufficiently long repeating substrings in one or several symbol sequences is proposed. The method is based on a specific gauge applied to symbol sequences that guarantees identification of the repeating substrings. It allows the matching of substrings to contain a given level of errors. The gauge is based on the development of a heavily sparse dictionary of repeats, thus drastically accelerating the search procedure. Some genomic applications illustrate the method. This paper is the extended and detailed version of the presentation at the third International Conference on Algorithms for Computational Biology to be held at Trujillo, Spain, June 21-22, 2016.Comment: 13 pages, 4 figure

arXiv.org e-Print Archive

Multiple pattern matching revisited

Author: Fredriksson Kimmo
Grabowski Szymon
Susik Robert
Publication venue
Publication date: 22/05/2014
Field of study

We consider the classical exact multiple string matching problem. Our solution is based on

q

-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets

arXiv.org e-Print Archive

End-to-End Entity Resolution for Big Data: A Survey

Author: Christophides Vassilis
Efthymiou Vasilis
Palpanas Themis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 19/08/2020
Field of study

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

arXiv.org e-Print Archive

Parallel decompression of gzip-compressed files and random access to DNA sequences

Author: Chikhi Rayan
Kerbiriou Maël
Publication venue
Publication date: 17/05/2019
Field of study

Decompressing a file made by the gzip program at an arbitrary location is in principle impossible, due to the nature of the DEFLATE compression algorithm. Consequently, no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfactory bottleneck, especially for the analysis of large sequencing data experiments. Here we propose a parallel algorithm and an implementation, pugz, that performs fast and exact decompression of any text file. We show that pugz is an order of magnitude faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study the related problem of random access to compressed data. We give simple models and experimental results that shed light on the structure of gzip-compressed files containing DNA sequences. Preliminary results show that random access to sequences within a gzip-compressed FASTQ file is almost always feasible at low compression levels, yet is approximate at higher compression levels.Comment: HiCOMB'1

arXiv.org e-Print Archive