585 research outputs found
On-line construction of position heaps
We propose a simple linear-time on-line algorithm for constructing a position
heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap
differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it
considers the suffixes ordered from left to right. Our construction is based on
classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees
[Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into
the augmented position heap that allows for a linear-time string matching
algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm
Dictionary Matching with One Gap
The dictionary matching with gaps problem is to preprocess a dictionary
of gapped patterns over alphabet , where each
gapped pattern is a sequence of subpatterns separated by bounded
sequences of don't cares. Then, given a query text of length over
alphabet , the goal is to output all locations in in which a
pattern , , ends. There is a renewed current interest
in the gapped matching problem stemming from cyber security. In this paper we
solve the problem where all patterns in the dictionary have one gap with at
least and at most don't cares, where and are
given parameters. Specifically, we show that the dictionary matching with a
single gap problem can be solved in either time and
space, and query time , where is the number
of patterns found, or preprocessing time and space: , and query
time , where is the number of patterns found.
As far as we know, this is the best solution for this setting of the problem,
where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201
Faster Approximate String Matching for Short Patterns
We study the classical approximate string matching problem, that is, given
strings and and an error threshold , find all ending positions of
substrings of whose edit distance to is at most . Let and
have lengths and , respectively. On a standard unit-cost word RAM with
word size we present an algorithm using time When is
short, namely, or this
improves the previously best known time bounds for the problem. The result is
achieved using a novel implementation of the Landau-Vishkin algorithm based on
tabulation and word-level parallelism.Comment: To appear in Theory of Computing System
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
On the suitability of suffix arrays for lempel-ziv data compression
Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical
Vertically Recurrent Neural Networks for Sub‐Grid Parameterization
Machine learning has the potential to improve the physical realism and/or computational efficiency of parameterizations. A typical approach has been to feed concatenated vertical profiles to a dense neural network. However, feed‐forward networks lack the connections to propagate information sequentially through the vertical column. Here we examine if predictions can be improved by instead traversing the column with recurrent neural networks (RNNs) such as Long Short‐Term Memory (LSTMs). This method encodes physical priors (locality) and uses parameters more efficiently. Firstly, we test RNN‐based radiation emulators in the Integrated Forecasting System. We achieve near‐perfect offline accuracy, and the forecast skill of a suite of global weather simulations using the emulator are for the most part statistically indistinguishable from reference runs. But can radiation emulators provide both high accuracy and a speed‐up? We find optimized, state‐of‐the‐art radiation code on CPU generally faster than RNN‐based emulators on GPU, although the latter can be more energy efficient. To test the method more broadly, and explore recent challenges in parameterization, we also adapt it to data sets from other studies. RNNs outperform reference feed‐forward networks in emulating gravity waves, and when combined with horizontal convolutions, for non‐local unified parameterization. In emulation of moist physics with memory, the RNNs have similar offline accuracy as ResNets, the previous state‐of‐the‐art. However, the RNNs are more efficient, and more stable in autoregressive semi‐prognostic tests. Multi‐step autoregressive training improves performance in these tests and enables a latent representation of convective memory. Recently proposed linearly recurrent models achieve similar performance to LSTMs
Compact q-gram Profiling of Compressed Strings
We consider the problem of computing the q-gram profile of a string \str of
size compressed by a context-free grammar with production rules. We
present an algorithm that runs in expected time and uses
O(n+q+\kq) space, where is the exact number of characters
decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct
q-grams in \str. This simultaneously matches the current best known time
bound and improves the best known space bound. Our space bound is
asymptotically optimal in the sense that any algorithm storing the grammar and
the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we
introduce the q-gram graph that space-efficiently captures the structure of a
string with respect to its q-grams, and show how to construct it from a
grammar
A suffix tree or not a suffix tree?
In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question ”Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if τ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes τ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that τ is a suffix tree if and only if it is realized by a string S of length n−1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]
- …
