Search CORE

585 research outputs found

On-line construction of position heaps

Author: A. Blumer
A. Ehrenfeucht
D. Gusfield
E. Coffman
E. Fredkin
E. Ukkonen
J.I. Munro
M. Crochemore
M. Crochemore
M. Crochemore
T. Cormen
Publication venue
Publication date: 01/01/2011
Field of study

We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm

arXiv.org e-Print Archive

Crossref

HAL Portal de Univ. Gustave Eiffel

HAL: Hyper Article en Ligne

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Approximate string matching with reduced alphabet

Author: B. Ďurian
E. Ukkonen
E. Ukkonen
E. Ukkonen
E. Ukkonen
E. Ukkonen
J. Kärkkäinen
J. Kärkkäinen
J. Tarhio
J. Tarhio
K. Fredriksson
K. Fredriksson
K. Fredriksson
L. Salmela
M. Fontaine
M.R. Garey
P. Jokinen
P. Jokinen
R. Baeza-Yates
R. Muth
R. Zhu
R.M. Karp
R.N. Horspool
R.S. Boyer
T. Berry
T. Lecroq
V. Mäkinen
V.L. Arlazarov
W.J. Masek
Z. Liu
Publication venue: Heidelberg, Berlin, Springer Verlag,
Publication date: 01/01/2010
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Dictionary Matching with One Gap

Author: A. Amir
A. Amir
A. Amir
A. Amir
A.V. Aho
E. Ukkonen
E.M. McCreight
G. Kucherov
G. Myers
G. Myers
G. Navarro
G. Navarro
G.S. Brodal
J.C. Naa
K. Fredriksson
M. Morgante
M. Zhang
M.S. Rahman
P. Bille
T. Haapasalo
Publication venue
Publication date: 01/01/2014
Field of study

The dictionary matching with gaps problem is to preprocess a dictionary

D

d

gapped patterns

P_1,\ldots,P_d

over alphabet

\Sigma

, where each gapped pattern

P_i

is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text

T

of length

n

over alphabet

\Sigma

, the goal is to output all locations in

T

in which a pattern

P_i\in D

1\leq i\leq d

, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least

\alpha

and at most

\beta

don't cares, where

\alpha

and

\beta

are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either

O(d\log d + |D|)

time and

O(d\log^{\varepsilon} d + |D|)

space, and query time

O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ)

, where

occ

is the number of patterns found, or preprocessing time and space:

O(d^2 + |D|)

, and query time

O(n(\beta -\alpha ) + occ)

, where

occ

is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

arXiv.org e-Print Archive

Crossref

Faster Approximate String Matching for Short Patterns

Author: A. Andersson
A.H. Wright
D. Gusfield
D. Harel
D.E. Knuth
E. Ukkonen
E. Ukkonen
E.W. Myers
F.T. Leighton
G. Myers
G. Navarro
G.M. Landau
H. Hyyrö
K.E. Batcher
M. Farach-Colton
M.A. Bender
P. Bille
P. Sellers
Philip Bille
R. Baeza-Yates
R. Cole
R.A. Baeza-Yates
R.A. Wagner
S. Albers
S. Alstrup
S. Wu
S.C. Sahinalp
T. Hagerup
T.H. Cormen
V.L. Arlazarov
W. Masek
Z. Galil
Z. Galil
Publication venue
Publication date: 17/03/2011
Field of study

We study the classical approximate string matching problem, that is, given strings

P

and

Q

and an error threshold

k

, find all ending positions of substrings of

Q

whose edit distance to

P

is at most

k

. Let

P

and

Q

have lengths

m

and

n

, respectively. On a standard unit-cost word RAM with word size

w \geq \log n

we present an algorithm using time

O(nk \cdot \min(\frac{\log^2 m}{\log n},\frac{\log^2 m\log w}{w}) + n)

When

P

is short, namely,

m = 2^{o(\sqrt{\log n})}

m = 2^{o(\sqrt{w/\log w})}

this improves the previously best known time bounds for the problem. The result is achieved using a novel implementation of the Landau-Vishkin algorithm based on tabulation and word-level parallelism.Comment: To appear in Theory of Computing System

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Efficient LZ78 factorization of grammar compressed text

Author: A. Amir
A. Jeż
E. Ukkonen
E.M. McCreight
J. Jansson
J. Westbrook
J. Ziv
J. Ziv
K. Goto
K. Goto
M. Crochemore
M. Li
M. Li
M.A. Bender
O. Berkman
P. Weiner
R. Cilibrasi
T. Kida
V. Freschi
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size

n

representing a text

S

of length

N

, our algorithm computes the LZ78 factorization of

T

O(n\sqrt{N}+m\log N)

time and

O(n\sqrt{N}+m)

space, where

m

is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the

n\sqrt{N}

term in the time and space complexities becomes either

nL

, where

L

is the length of the longest LZ78 factor, or

(N - \alpha)

where

\alpha \geq 0

is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of

S

of a certain length. Since

m = O(N/\log_\sigma N)

where

\sigma

is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when

\sigma

is constant, and can be more efficient when the text is compressible, i.e. when

m

and

n

are small.Comment: SPIRE 201

arXiv.org e-Print Archive

Crossref

On the suitability of suffix arrays for lempel-ziv data compression

Author: D. Gusfield
D. Salomon
E. McCreight
E. Ukkonen
J. Karkainen
J. Storer
J. Ziv
K. Sadakane
M. Abouelhoda
U. Manber
Publication venue: Springer-Verlag Berlin
Publication date: 01/01/2009
Field of study

Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical

RCIPL Repositório Científico do Instituto Politécnico de Lisboa

Crossref

Vertically Recurrent Neural Networks for Sub‐Grid Parameterization

Author: Chantry M.
Ukkonen P.
Publication venue: Wiley
Publication date: 10/06/2025
Field of study

Machine learning has the potential to improve the physical realism and/or computational efficiency of parameterizations. A typical approach has been to feed concatenated vertical profiles to a dense neural network. However, feed‐forward networks lack the connections to propagate information sequentially through the vertical column. Here we examine if predictions can be improved by instead traversing the column with recurrent neural networks (RNNs) such as Long Short‐Term Memory (LSTMs). This method encodes physical priors (locality) and uses parameters more efficiently. Firstly, we test RNN‐based radiation emulators in the Integrated Forecasting System. We achieve near‐perfect offline accuracy, and the forecast skill of a suite of global weather simulations using the emulator are for the most part statistically indistinguishable from reference runs. But can radiation emulators provide both high accuracy and a speed‐up? We find optimized, state‐of‐the‐art radiation code on CPU generally faster than RNN‐based emulators on GPU, although the latter can be more energy efficient. To test the method more broadly, and explore recent challenges in parameterization, we also adapt it to data sets from other studies. RNNs outperform reference feed‐forward networks in emulating gravity waves, and when combined with horizontal convolutions, for non‐local unified parameterization. In emulation of moist physics with memory, the RNNs have similar offline accuracy as ResNets, the previous state‐of‐the‐art. However, the RNNs are more efficient, and more stable in autoregressive semi‐prognostic tests. Multi‐step autoregressive training improves performance in these tests and enables a latent representation of convective memory. Recently proposed linearly recurrent models achieve similar performance to LSTMs

Oxford University Research Archive

Compact q-gram Profiling of Compressed Strings

Author: E. Ukkonen
G. Paaß
J. Kärkkäinen
J. Ziv
J. Ziv
K. Goto
K. Goto
M. Charikar
R.M. Karp
T. Gärtner
T. Shibuya
W. Matsubara
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

We consider the problem of computing the q-gram profile of a string \str of size

N

compressed by a context-free grammar with

n

production rules. We present an algorithm that runs in

O(N-\alpha)

expected time and uses O(n+q+\kq) space, where

N-\alpha\leq qn

is the exact number of characters decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct q-grams in \str. This simultaneously matches the current best known time bound and improves the best known space bound. Our space bound is asymptotically optimal in the sense that any algorithm storing the grammar and the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we introduce the q-gram graph that space-efficiently captures the structure of a string with respect to its q-grams, and show how to construct it from a grammar

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

A suffix tree or not a suffix tree?

Author: A Apostolico
B Cazaux
D Breslauer
D Breslauer
D Gusfield
E Ukkonen
G Kucherov
H Bannai
JP Duval
JP Duval
JP Duval
M Crochemore
P Gawrychowski
T I
T I
T I
Tomohiro I.
W Lu
Publication venue: 'Elsevier BV'
Publication date: 01/09/2014
Field of study

In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question ”Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if τ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes τ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that τ is a suffix tree if and only if it is realized by a string S of length n−1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

Explore Bristol Research