33,834 research outputs found
Average-Case Optimal Approximate Circular String Matching
Approximate string matching is the problem of finding all factors of a text t
of length n that are at a distance at most k from a pattern x of length m.
Approximate circular string matching is the problem of finding all factors of t
that are at a distance at most k from x or from any of its rotations. In this
article, we present a new algorithm for approximate circular string matching
under the edit distance model with optimal average-case search time O(n(k + log
m)/m). Optimal average-case search time can also be achieved by the algorithms
for multiple approximate string matching (Fredriksson and Navarro, 2004) using
x and its rotations as the set of multiple patterns. Here we reduce the
preprocessing time and space requirements compared to that approach
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
A new problem in string searching
We describe a substring search problem that arises in group presentation
simplification processes. We suggest a two-level searching model: skip and
match levels. We give two timestamp algorithms which skip searching parts of
the text where there are no matches at all and prove their correctness. At the
match level, we consider Harrison signature, Karp-Rabin fingerprint, Bloom
filter and automata based matching algorithms and present experimental
performance figures.Comment: To appear in Proceedings Fifth Annual International Symposium on
Algorithms and Computation (ISAAC'94), Lecture Notes in Computer Scienc
Increasing the power efficiency of Bloom filters for network string matching
Although software based techniques are widely accepted in computer security systems, there is a growing interest to utilize hardware opportunities in order to compensate for the network bandwidth increases. Recently, hardware based virus protection systems have started to emerge. These type of hardware systems work by identifying the malicious content and removing it from the network streams. In principle, they make use of string matching. Bit by bit, they compare the virus signatures with the bit strings in the network. The Bloom filters are ideal data structures for string matching. Nonetheless, they consume large power when many of them used in parallel to match different virus signatures. In this paper, we propose a new type of Bloom filter architecture which exploits well-known pipelining technique. © 2006 IEEE
A Bloom filter based semi-index on -grams
We present a simple -gram based semi-index, which allows to look for a
pattern typically only in a small fraction of text blocks. Several space-time
tradeoffs are presented. Experiments on Pizza & Chili datasets show that our
solution is up to three orders of magnitude faster than the Claude et al.
\cite{CNPSTjda10} semi-index at a comparable space usage
Lightweight Lempel-Ziv Parsing
We introduce a new approach to LZ77 factorization that uses O(n/d) words of
working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet
sizes). We also describe carefully engineered implementations of alternative
approaches to lightweight LZ77 factorization. Extensive experiments show that
the new algorithm is superior in most cases, particularly at the lowest memory
levels and for highly repetitive data. As a part of the algorithm, we describe
new methods for computing matching statistics which may be of independent
interest.Comment: 12 page
Dictionary Matching with One Gap
The dictionary matching with gaps problem is to preprocess a dictionary
of gapped patterns over alphabet , where each
gapped pattern is a sequence of subpatterns separated by bounded
sequences of don't cares. Then, given a query text of length over
alphabet , the goal is to output all locations in in which a
pattern , , ends. There is a renewed current interest
in the gapped matching problem stemming from cyber security. In this paper we
solve the problem where all patterns in the dictionary have one gap with at
least and at most don't cares, where and are
given parameters. Specifically, we show that the dictionary matching with a
single gap problem can be solved in either time and
space, and query time , where is the number
of patterns found, or preprocessing time and space: , and query
time , where is the number of patterns found.
As far as we know, this is the best solution for this setting of the problem,
where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201
Recommended from our members
Fast Computation of the Fitness Function for Protein Folding Prediction in a 2D Hydrophobic-Hydrophilic Model
Protein Folding Prediction (PFP) is essentially an energy minimization problem formalised by the definition of a fitness function. Several PFP models have been proposed including the Hydrophobic-Hydrophilic (HP) model, which is widely used as a test-bed for evaluating new algorithms. The calculation of the fitness is the major computational task in determining the native conformation of a protein in the HP model and this paper presents a new efficient search algorithm (ESA) for deriving the fitness value requiring only O(n) complexity in contrast to the full search approach, which takes O(n2). The improved efficiency of ESA is achieved by exploiting some intrinsic properties of the HP model, with a resulting reduction of more than 50% in the overall time complexity when compared with the previously reported Caching Approach, with the added benefit that the additional space complexity is linear instead of quadratic
String Indexing with Compressed Patterns
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern
- …