Search CORE

424 research outputs found

On-line construction of position heaps

Author: A. Blumer
A. Ehrenfeucht
D. Gusfield
E. Coffman
E. Fredkin
E. Ukkonen
J.I. Munro
M. Crochemore
M. Crochemore
M. Crochemore
T. Cormen
Publication venue
Publication date: 01/01/2011
Field of study

We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm

arXiv.org e-Print Archive

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Approximate string matching with reduced alphabet

Author: B. Ďurian
E. Ukkonen
E. Ukkonen
E. Ukkonen
E. Ukkonen
E. Ukkonen
J. Kärkkäinen
J. Kärkkäinen
J. Tarhio
J. Tarhio
K. Fredriksson
K. Fredriksson
K. Fredriksson
L. Salmela
M. Fontaine
M.R. Garey
P. Jokinen
P. Jokinen
R. Baeza-Yates
R. Muth
R. Zhu
R.M. Karp
R.N. Horspool
R.S. Boyer
T. Berry
T. Lecroq
V. Mäkinen
V.L. Arlazarov
W.J. Masek
Z. Liu
Publication venue: Heidelberg, Berlin, Springer Verlag,
Publication date: 01/01/2010
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Web search queries can predict stock market volumes

Author: Battiston S.
Bordino I.
Caldarelli G.
Cristelli M.
Ukkonen A.
Weber I.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

We live in a computerized and networked society where many of our actions leave a digital trace and affect other people's actions. This has lead to the emergence of a new data-driven research field: mathematical methods of computer science, statistical physics and sociometry provide insights on a wide range of disciplines ranging from social science to human mobility. A recent important discovery is that search engine traffic (i.e., the number of requests submitted by users to search engines on the www) can be used to track and, in some cases, to anticipate the dynamics of social phenomena. Successful examples include unemployment levels, car and home sales, and epidemics spreading. Few recent works applied this approach to stock prices and market sentiment. However, it remains unclear if trends in financial markets can be anticipated by the collective wisdom of on-line users on the web. Here we show that daily trading volumes of stocks traded in NASDAQ-100 are correlated with daily volumes of queries related to the same stocks. In particular, query volumes anticipate in many cases peaks of trading by one day or more. Our analysis is carried out on a unique dataset of queries, submitted to an important web search engine, which enable us to investigate also the user behavior. We show that the query volume dynamics emerges from the collective but seemingly uncoordinated activity of many users. These findings contribute to the debate on the identification of early warnings of financial systemic risk, based on the activity of users of the www. © 2012 Bordino et al

Public Library of Science (PLOS)

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Web Search Queries Can Predict Stock Market Volumes

Author: Battiston S
Bordino I
Caldarelli G
Cristelli M
Ukkonen A
Weber I
Publication venue
Publication date: 01/01/2012
Field of study

Archivio della ricerca della Scuola IMT Alti Studi Lucca

Fast Algorithm for Partial Covers in Words

Author: A. Apostolico
A. Apostolico
A. Apostolico
A.S. Fraenkel
D. Breslauer
D. Gusfield
D. Moore
E. Ukkonen
G.S. Brodal
G.S. Brodal
J.S. Sim
M. Crochemore
M.R. Brown
Y. Li
Publication venue
Publication date: 01/01/2013
Field of study

A factor

u

of a word

w

is a cover of

w

if every position in

w

lies within some occurrence of

u

w

. A word

w

covered by

u

thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of

u

. In this article we introduce a new notion of

\alpha

-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least

\alpha

positions in

w

. We develop a data structure of

O(n)

size (where

n=|w|

) that can be constructed in

O(n\log n)

time which we apply to compute all shortest

\alpha

-partial covers for a given

\alpha

. We also employ it for an

O(n\log n)

-time algorithm computing a shortest

\alpha

-partial cover for each

\alpha=1,2,\ldots,n

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

King's Research Portal

Efficient LZ78 factorization of grammar compressed text

Author: A. Amir
A. Jeż
E. Ukkonen
E.M. McCreight
J. Jansson
J. Westbrook
J. Ziv
J. Ziv
K. Goto
K. Goto
M. Crochemore
M. Li
M. Li
M.A. Bender
O. Berkman
P. Weiner
R. Cilibrasi
T. Kida
V. Freschi
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size

n

representing a text

S

of length

N

, our algorithm computes the LZ78 factorization of

T

O(n\sqrt{N}+m\log N)

time and

O(n\sqrt{N}+m)

space, where

m

is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the

n\sqrt{N}

term in the time and space complexities becomes either

nL

, where

L

is the length of the longest LZ78 factor, or

(N - \alpha)

where

\alpha \geq 0

is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of

S

of a certain length. Since

m = O(N/\log_\sigma N)

where

\sigma

is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when

\sigma

is constant, and can be more efficient when the text is compressible, i.e. when

m

and

n

are small.Comment: SPIRE 201

arXiv.org e-Print Archive

Crossref

Dictionary Matching with One Gap

Author: A. Amir
A. Amir
A. Amir
A. Amir
A.V. Aho
E. Ukkonen
E.M. McCreight
G. Kucherov
G. Myers
G. Myers
G. Navarro
G. Navarro
G.S. Brodal
J.C. Naa
K. Fredriksson
M. Morgante
M. Zhang
M.S. Rahman
P. Bille
T. Haapasalo
Publication venue
Publication date: 01/01/2014
Field of study

The dictionary matching with gaps problem is to preprocess a dictionary

D

d

gapped patterns

P_1,\ldots,P_d

over alphabet

\Sigma

, where each gapped pattern

P_i

is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text

T

of length

n

over alphabet

\Sigma

, the goal is to output all locations in

T

in which a pattern

P_i\in D

1\leq i\leq d

, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least

\alpha

and at most

\beta

don't cares, where

\alpha

and

\beta

are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either

O(d\log d + |D|)

time and

O(d\log^{\varepsilon} d + |D|)

space, and query time

O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ)

, where

occ

is the number of patterns found, or preprocessing time and space:

O(d^2 + |D|)

, and query time

O(n(\beta -\alpha ) + occ)

, where

occ

is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

arXiv.org e-Print Archive

Crossref

Suffix Tree of Alignment: An Efficient Index for Similar Data

Author: A. Amir
D. Gusfield
E. Ukkonen
E.M. McCreight
G. Navarro
H.H. Do
J. Ziv
K. Sadakane
M. Crochemore
M. Farach-Colton
P. Bille
R. Grossi
R.A. Baeza-Yates
S. Huang
S. Karlin
S. Kuruppu
V. Levenshtein
V. Mäkinen
V. Mäkinen
Publication venue
Publication date: 01/01/2013
Field of study

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings

A

and

B

is a compacted trie representing all suffixes in

A

and

B

. It has

|A|+|B|

leaves and can be constructed in

O(|A|+|B|)

time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of

A

and

B

. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of

A

and

B

has

|A| + l_d + l_1

leaves where

l_d

is the sum of the lengths of all parts of

B

different from

A

and

l_1

is the sum of the lengths of some common parts of

A

and

B

. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern

P

O(|P|+occ)

time where

occ

is the number of occurrences of

P

A

and

B

. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires

O(|A| + l_d + l_1 + l_2)

time where

l_2

is the sum of the lengths of other common substrings of

A

and

B

. When the suffix tree of

A

is already given, it requires

O(l_d + l_1 + l_2)

time.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

King's Research Portal

Faster Approximate String Matching for Short Patterns

Author: A. Andersson
A.H. Wright
D. Gusfield
D. Harel
D.E. Knuth
E. Ukkonen
E. Ukkonen
E.W. Myers
F.T. Leighton
G. Myers
G. Navarro
G.M. Landau
H. Hyyrö
K.E. Batcher
M. Farach-Colton
M.A. Bender
P. Bille
P. Sellers
Philip Bille
R. Baeza-Yates
R. Cole
R.A. Baeza-Yates
R.A. Wagner
S. Albers
S. Alstrup
S. Wu
S.C. Sahinalp
T. Hagerup
T.H. Cormen
V.L. Arlazarov
W. Masek
Z. Galil
Z. Galil
Publication venue
Publication date: 17/03/2011
Field of study

We study the classical approximate string matching problem, that is, given strings

P

and

Q

and an error threshold

k

, find all ending positions of substrings of

Q

whose edit distance to

P

is at most

k

. Let

P

and

Q

have lengths

m

and

n

, respectively. On a standard unit-cost word RAM with word size

w \geq \log n

we present an algorithm using time

O(nk \cdot \min(\frac{\log^2 m}{\log n},\frac{\log^2 m\log w}{w}) + n)

When

P

is short, namely,

m = 2^{o(\sqrt{\log n})}

m = 2^{o(\sqrt{w/\log w})}

this improves the previously best known time bounds for the problem. The result is achieved using a novel implementation of the Landau-Vishkin algorithm based on tabulation and word-level parallelism.Comment: To appear in Theory of Computing System

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology