Search CORE

69,666 research outputs found

Lempel-Ziv Parsing in External Memory

Author: Kempa Dominik
Kärkkäinen Juha
Puglisi Simon J.
Publication venue
Publication date: 04/07/2013
Field of study

For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

EERTREE: An Efficient Data Structure for Processing Palindromes in Strings

Author: Rubinchik Mikhail
Shur Arseny M.
Publication venue
Publication date: 17/08/2015
Field of study

We propose a new linear-size data structure which provides a fast access to all palindromic substrings of a string or a set of strings. This structure inherits some ideas from the construction of both the suffix trie and suffix tree. Using this structure, we present simple and efficient solutions for a number of problems involving palindromes.Comment: 21 pages, 2 figures. Accepted to IWOCA 201

arXiv.org e-Print Archive

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Measuring and Understanding Throughput of Network Topologies

Author: Godfrey P. Brighten
Jyothi Sangeetha Abdu
Kolla Alexandra
Singla Ankit
Publication venue
Publication date: 14/11/2016
Field of study

High throughput is of particular interest in data center and HPC networks. Although myriad network topologies have been proposed, a broad head-to-head comparison across topologies and across traffic patterns is absent, and the right way to compare worst-case throughput performance is a subtle problem. In this paper, we develop a framework to benchmark the throughput of network topologies, using a two-pronged approach. First, we study performance on a variety of synthetic and experimentally-measured traffic matrices (TMs). Second, we show how to measure worst-case throughput by generating a near-worst-case TM for any given topology. We apply the framework to study the performance of these TMs in a wide range of network topologies, revealing insights into the performance of topologies with scaling, robustness of performance across TMs, and the effect of scattered workload placement. Our evaluation code is freely available

arXiv.org e-Print Archive

CiteSeerX

Crossref

On Maximal Unbordered Factors

Author: A Ehrenfeucht
D Moore
F Franĕk
J-P Duval
J-P Duval
J-P Duval
L Ilie
P Gawrychowski
P Nielsen
R Assous
S Holub
T Kociumaka
Publication venue
Publication date: 28/04/2015
Field of study

Given a string

S

of length

n

, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between

n

and the length of the maximal unbordered factor of

S

. We prove that for the alphabet of size

\sigma \ge 5

the expected length of the maximal unbordered factor of a string of length~

n

is at least

0.99 n

(for sufficiently large values of

n

). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

Explore Bristol Research

HAL - UPEC / UPEM

Internal Pattern Matching Queries in a Text and Applications

Author: Kociumaka Tomasz
Radoszewski Jakub
Rytter Wojciech
Waleń Tomasz
Publication venue
Publication date: 13/10/2014
Field of study

We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword

x

in another subword

y

of a given text, assuming that

|y|=\mathcal{O}(|x|)

, which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding

\delta

-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed

\delta

we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201

arXiv.org e-Print Archive

Crossref

Faster Compact On-Line Lempel-Ziv Factorization

Author: Bannai Hideo
I Tomohiro
Inenaga Shunsuke
Takeda Masayuki
Yamamoto Jun'ichi
Publication venue
Publication date: 26/05/2013
Field of study

We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in

O(N\log N)

time and uses only

O(N\log\sigma)

bits of working space, where

N

is the length of the string and

\sigma

is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either

O(N\log^3N)

time (Okanohara & Sadakane 2009) or

O(N\log^2N)

time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size

m

of a string of length

N

, computes the Lempel-Ziv factorization on-line, in

O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)

time and

O(m\log N)

bits of space, which is faster and more space efficient when the string is run-length compressible

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Efficient LZ78 factorization of grammar compressed text

Author: A. Amir
A. Jeż
E. Ukkonen
E.M. McCreight
J. Jansson
J. Westbrook
J. Ziv
J. Ziv
K. Goto
K. Goto
M. Crochemore
M. Li
M. Li
M.A. Bender
O. Berkman
P. Weiner
R. Cilibrasi
T. Kida
V. Freschi
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size

n

representing a text

S

of length

N

, our algorithm computes the LZ78 factorization of

T

O(n\sqrt{N}+m\log N)

time and

O(n\sqrt{N}+m)

space, where

m

is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the

n\sqrt{N}

term in the time and space complexities becomes either

nL

, where

L

is the length of the longest LZ78 factor, or

(N - \alpha)

where

\alpha \geq 0

is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of

S

of a certain length. Since

m = O(N/\log_\sigma N)

where

\sigma

is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when

\sigma

is constant, and can be more efficient when the text is compressible, i.e. when

m

and

n

are small.Comment: SPIRE 201

arXiv.org e-Print Archive

Crossref