32,456 research outputs found
Exact Online String Matching Bibliography
In this short note we present a comprehensive bibliography for the online
exact string matching problem. The problem consists in finding all occurrences
of a given pattern in a text. It is an extensively studied problem in computer
science, mainly due to its direct applications to such diverse areas as text,
image and signal processing, speech analysis and recognition, data compression,
information retrieval, computational biology and chemistry. Since 1970 more
than 120 string matching algorithms have been proposed. In this note we present
a comprehensive list of (almost) all string matching algorithms. The list is
updated to May 2016.Comment: 23 page
Fast Algorithms for Exact String Matching
Given a pattern string of length and a query string of length
, where the characters of and are drawn from an alphabet of size
, the {\em exact string matching} problem consists of finding all
occurrences of in . For this problem, we present algorithms that in
time pre-process to essentially identify , a
rarely occurring substring of , and then use it to find occurrences of
in efficiently. Our algorithms require a worst case search time of ,
and expected search time of , where
is at least (i.e. the number of distinct characters in
), and for most pattern strings it is observed to be
A Fast Heuristic for Exact String Matching
Given a pattern string of length consisting of distinct
characters and a query string of length , where the characters of
and are drawn from an alphabet of size , the {\em exact
string matching} problem consists of finding all occurrences of in . For
this problem, we present a randomized heuristic that in time
preprocesses to identify , a rarely occurring substring of ,
and then use it to find all occurrences of in efficiently. This
heuristic has an expected search time of , where is at least . We also show that for a
pattern string whose characters are chosen uniformly at random from an
alphabet of size , is .Comment: arXiv admin note: substantial text overlap with arXiv:1509.0922
On the Average-case Complexity of Pattern Matching with Wildcards
Pattern matching with wildcards is the problem of finding all factors of a
text of length that match a pattern of length , where wildcards
(characters that match everything) may be present. In this paper we present a
number of fast average-case algorithms for pattern matching where wildcards are
restricted to either the pattern or the text, however, the results are easily
adapted to the case where wildcards are allowed in both. We analyse the
\textit{average-case} complexity of these algorithms and show the first
non-trivial time bounds. These are the first results on the average-case
complexity of pattern matching with wildcards which, as a by product, provide
with first provable separation in complexity between exact pattern matching and
pattern matching with wildcards in the word RAM model
New algorithms for binary jumbled pattern matching
Given a pattern and a text , both strings over a binary alphabet, the
binary jumbled string matching problem consists in telling whether any
permutation of occurs in . The indexed version of this problem, i.e.,
preprocessing a string to efficiently answer such permutation queries, is hard
and has been studied in the last few years. Currently the best bounds for this
problem are (with O(n) space and O(1) query time) and
(with O(|L|) space and query time), where is
the length of the run-length encoding of and is the size of
the index. In this paper we present new results for this problem. Our first
result is an alternative construction of the index by Badkobeh et al. that
obtains a trade-off between the space and the time complexity. It has
complexity to build the index, query time, and
uses space, where is a parameter. The second result is an
algorithm (with O(n) space and O(1) query time), based on
word-level parallelism where is the word size in bits
A Hybrid Parallel Implementation of the Aho-Corasick and Wu-Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database
Multiple matching algorithms are used to locate the occurrences of patterns
from a finite pattern set in a large input string. Aho-Corasick and Wu-Manber,
two of the most well known algorithms for multiple matching require an
increased computing power, particularly in cases where large-size datasets must
be processed, as is common in computational biology applications. Over the past
years, Graphics Processing Units (GPUs) have evolved to powerful parallel
processors outperforming Central Processing Units (CPUs) in scientific
calculations. Moreover, multiple GPUs can be used in parallel, forming hybrid
computer cluster configurations to achieve an even higher processing
throughput. This paper evaluates the speedup of the parallel implementation of
the Aho-Corasick and Wu-Manber algorithms on a hybrid GPU cluster, when used to
process a snapshot of the Expressed Sequence Tags of the human genome and for
different problem parameters
New Error Tolerant Method to Search Long Repeats in Symbol Sequences
A new method to identify all sufficiently long repeating substrings in one or
several symbol sequences is proposed. The method is based on a specific gauge
applied to symbol sequences that guarantees identification of the repeating
substrings. It allows the matching of substrings to contain a given level of
errors. The gauge is based on the development of a heavily sparse dictionary of
repeats, thus drastically accelerating the search procedure. Some genomic
applications illustrate the method.
This paper is the extended and detailed version of the presentation at the
third International Conference on Algorithms for Computational Biology to be
held at Trujillo, Spain, June 21-22, 2016.Comment: 13 pages, 4 figure
Multiple pattern matching revisited
We consider the classical exact multiple string matching problem. Our
solution is based on -grams combined with pattern superimposition,
bit-parallelism and alphabet size reduction. We discuss the pros and cons of
the various alternatives of how to achieve best combination. Our method is
closely related to previous work by (Salmela et al., 2006). The experimental
results show that our method performs well on different alphabet sizes and that
they scale to large pattern sets
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Parallel decompression of gzip-compressed files and random access to DNA sequences
Decompressing a file made by the gzip program at an arbitrary location is in
principle impossible, due to the nature of the DEFLATE compression algorithm.
Consequently, no existing program can take advantage of parallelism to rapidly
decompress large gzip-compressed files. This is an unsatisfactory bottleneck,
especially for the analysis of large sequencing data experiments. Here we
propose a parallel algorithm and an implementation, pugz, that performs fast
and exact decompression of any text file. We show that pugz is an order of
magnitude faster than gunzip, and 5x faster than a highly-optimized sequential
implementation (libdeflate). We also study the related problem of random access
to compressed data. We give simple models and experimental results that shed
light on the structure of gzip-compressed files containing DNA sequences.
Preliminary results show that random access to sequences within a
gzip-compressed FASTQ file is almost always feasible at low compression levels,
yet is approximate at higher compression levels.Comment: HiCOMB'1
- …