3,505 research outputs found
Improved algorithms for string searching problems
We present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems.
In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position.
Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction.
We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms.
We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns
String Matching: Communication, Circuits, and Learning
String matching is the problem of deciding whether a given n-bit string contains a given k-bit pattern. We study the complexity of this problem in three settings.
- Communication complexity. For small k, we provide near-optimal upper and lower bounds on the communication complexity of string matching. For large k, our bounds leave open an exponential gap; we exhibit some evidence for the existence of a better protocol.
- Circuit complexity. We present several upper and lower bounds on the size of circuits with threshold and DeMorgan gates solving the string matching problem. Similarly to the above, our bounds are near-optimal for small k.
- Learning. We consider the problem of learning a hidden pattern of length at most k relative to the classifier that assigns 1 to every string that contains the pattern. We prove optimal bounds on the VC dimension and sample complexity of this problem
Approximate Hamming distance in a stream
We consider the problem of computing a -approximation of the
Hamming distance between a pattern of length and successive substrings of a
stream. We first look at the one-way randomised communication complexity of
this problem, giving Alice the first half of the stream and Bob the second
half. We show the following: (1) If Alice and Bob both share the pattern then
there is an bit randomised one-way communication
protocol. (2) If only Alice has the pattern then there is an
bit randomised one-way communication protocol.
We then go on to develop small space streaming algorithms for
-approximate Hamming distance which give worst case running time
guarantees per arriving symbol. (1) For binary input alphabets there is an
space and
time streaming -approximate Hamming distance algorithm. (2) For
general input alphabets there is an
space and time streaming
-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201
Parameterized Matching in the Streaming Model
We study the problem of parameterized matching in a stream where we want to
output matches between a pattern of length m and the last m symbols of the
stream before the next symbol arrives. Parameterized matching is a natural
generalisation of exact matching where an arbitrary one-to-one relabelling of
pattern symbols is allowed. We show how this problem can be solved in constant
time per arriving stream symbol and sublinear, near optimal space with high
probability. Our results are surprising and important: it has been shown that
almost no streaming pattern matching problems can be solved (not even
randomised) in less than Theta(m) space, with exact matching as the only known
problem to have a sublinear, near optimal space solution. Here we demonstrate
that a similar sublinear, near optimal space solution is achievable for an even
more challenging problem. The proof is considerably more complex than that for
exact matching.Comment: 19 pages, 3 figure
COMPARATIVE ANALYSIS OF BIT-PARALLEL STRING PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL SEQUENCES
The inherent parallelism in a bit operation like AND/OR inside a computer word is known as bit parallelism. It plays a greater role in string pattern matching and has good application in the analysis of biological data. The use of recently developed bit parallel string matching algorithms approaches helps in improving the efficiency of the other string pattern matching algorithms. This paper discusses the working of some of these bit parallel string matching algorithms and their application on biological sequences. It also shows how bit-parallelism can be efficiently used to address various matching problems in Bioinformatics to analyze biological sequences such as Deoxyribonucleic acid (DNA), Ribonucleic acid (RNA), and Protein with examples. It can also serve as a greater tool for researchers when looking for the appropriate method to use on Biological sequences
Fitness sharing and niching methods revisited
Interest in multimodal optimization function is expanding rapidly since real-world optimization problems often require the location of multiple optima in the search space. In this context, fitness sharing has been used widely to maintain population diversity and permit the investigation of many peaks in the feasible domain. This paper reviews various strategies of sharing and proposes new recombination schemes to improve its efficiency. Some empirical results are presented for high and a limited number of fitness function evaluations. Finally, the study
compares the sharing method with other niching techniques
The k-mismatch problem revisited
We revisit the complexity of one of the most basic problems in pattern
matching. In the k-mismatch problem we must compute the Hamming distance
between a pattern of length m and every m-length substring of a text of length
n, as long as that Hamming distance is at most k. Where the Hamming distance is
greater than k at some alignment of the pattern and text, we simply output
"No".
We study this problem in both the standard offline setting and also as a
streaming problem. In the streaming k-mismatch problem the text arrives one
symbol at a time and we must give an output before processing any future
symbols. Our main results are as follows:
1) Our first result is a deterministic time offline algorithm for k-mismatch on a text of length n. This is a
factor of k improvement over the fastest previous result of this form from SODA
2000 by Amihood Amir et al.
2) We then give a randomised and online algorithm which runs in the same time
complexity but requires only space in total.
3) Next we give a randomised -approximation algorithm for the
streaming k-mismatch problem which uses
space and runs in worst-case time per
arriving symbol.
4) Finally we combine our new results to derive a randomised
space algorithm for the streaming k-mismatch problem
which runs in worst-case time per
arriving symbol. This improves the best previous space complexity for streaming
k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We
also improve the time complexity of this previous result by an even greater
factor to match the fastest known offline algorithm (up to logarithmic
factors)
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
Image and interpretation using artificial intelligence to read ancient Roman texts
The ink and stylus tablets discovered at the Roman Fort of Vindolanda are a unique resource for scholars of ancient history. However, the stylus tablets have proved particularly difficult to read. This paper describes a system that assists expert papyrologists in the interpretation of the Vindolanda writing tablets. A model-based approach is taken that relies on models of the written form of characters, and statistical modelling of language, to produce plausible interpretations of the documents. Fusion of the contributions from the language, character, and image feature models is achieved by utilizing the GRAVA agent architecture that uses Minimum Description Length as the basis for information fusion across semantic levels. A system is developed that reads in image data and outputs plausible interpretations of the Vindolanda tablets
- …