Search CORE

185,074 research outputs found

Longest Common Extensions in Sublinear Space

Author: A Amir
D Gusfield
D Harel
EW Myers
G Manacher
GM Landau
GM Landau
GM Landau
MG Main
NJ Fine
P Bille
R Cole
R Kolpakov
RM Karp
Publication venue
Publication date: 01/01/2015
Field of study

The longest common extension problem (LCE problem) is to construct a data structure for an input string

T

of length

n

that supports LCE

(i,j)

queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions

i

and

j

T

. This classic problem has a well-known solution that uses

O(n)

space and

O(1)

query time. In this paper we show that for any trade-off parameter

1 \leq \tau \leq n

, the problem can be solved in

O(\frac{n}{\tau})

space and

O(\tau)

query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Online Research Database In Technology

Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

Author: Pagh Rasmus
Rao S. Srinivasa
Publication venue
Publication date: 18/11/2008
Field of study

Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in [a_l; a_r]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I_{[a_l;a_r]}, assuming that the size of internal memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I_{[a_l;a_r]} containing each element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}| log(1/epsilon)) bits.Comment: 16 page

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository

Exploring Superpage Promotion Policies for Efficient Address Translation

Author: Zhu Weixi
Publication venue
Publication date: 16/05/2019
Field of study

Address translation performance for modern applications depends heavily upon the number of translation entries cached in the hardware TLB (translation look-aside buffer). Therefore, the efficiency of address translation relies directly on the TLB hit rate. The number of TLB entries continues to fall further behind the growth of memory consumption for modern applications. Superpages, which are pages with larger sizes, can increase the efficiency of the TLB by enabling each translation entry to cover a larger memory region. Without requiring more TLB entries, using superpages can increase the TLB hit rate and benefit address translation. However, using superpages can bring overhead. The TLB uses a single dirty bit to mark a page as dirty during address translation before modifying the page, so the granularity of the dirty bit corresponds to the coverage of the translation entry. As a result, the OS (operating system) will pay extra I/O effort when it allocates or writes an underutilized superpage back to disk. Such extra overhead can easily surpass the address translation benefits of superpages. This thesis discusses the performance trade-offs of superpages by exploring the design space of superpage promotion policies in the OS. A data collection infrastructure is built based on QEMU with kernel instrumentation on FreeBSD to collaboratively collect both memory accesses and kernel events. Then, the TLB behavior of Intel Skylake x86 family processors is simulated. The simulation has been validated to be faithful and consistent with the real-world performance. Last, this thesis evaluates and compares both TLB performance benefits and I/O overheads among the superpage promotion policies to discuss the trade-offs in the design space

DSpace at Rice University

Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing

Author: Bille Philip
Ettienne Mikko Berggren
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)
Publication date: 01/01/2017
Field of study

Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in (i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or (ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1 In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search

Dagstuhl Research Online Publication Server

String Indexing for Top- $k$ Close Consecutive Occurrences

Author: Bille Philip
Gørtz Inge Li
Pedersen Max Rishøj
Rotenberg Eva
Steiner Teresa Anna
Publication venue
Publication date: 29/09/2020
Field of study

The classic string indexing problem is to preprocess a string

S

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

P

, report all occurrences of

P

within

S

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

k

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

(i,j)

i < j

, such that

P

occurs at positions

i

and

j

S

and there is no occurrence of

P

between

i

and

j

, and their distance is defined as

j-i

. Given a pattern

P

and a parameter

k

, the goal is to report the top-

k

consecutive occurrences of

P

S

of minimal distance. The challenge is to compactly represent

S

while supporting queries in time close to length of

P

and

k

. We give two time-space trade-offs for the problem. Let

n

be the length of

S

m

the length of

P

, and

\epsilon\in(0,1]

. Our first result achieves

O(n\log n)

space and optimal query time of

O(m+k)

, and our second result achieves linear space and query time

O(m+k^{1+\epsilon})

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

arXiv.org e-Print Archive

Online Research Database In Technology

String Indexing for Top-k Close Consecutive Occurrences

Author: Bille Philip
Rotenberg Eva
Steiner Teresa Anna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)
Publication date: 01/01/2020
Field of study

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees

Dagstuhl Research Online Publication Server

Time-space trade-offs for lempel-ziv compressed indexing

Author: Bille Philip
Ettienne Mikko Berggren
Gørtz Inge Li
Vildhøj Hjalte Wedel
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Given a string

S

, the \emph{compressed indexing problem} is to preprocess

S

into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of

S

while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets; (i)

O(m + occ \lg\lg n)

time using

O(z\lg(n/z)\lg\lg z)

space, or (ii)

O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z))

time using

O(z\lg(n/z))

space. For integer alphabets polynomially bounded by

n

; (iii)

O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z))

time using

O(z(\lg(n/z) + \lg\lg z))

space, or (iv)

O(m + occ(\lg\lg n + \lg^{\epsilon} z))

time using

O(z(\lg(n/z) + \lg^{\epsilon} z))

space, where

n

and

m

are the length of the input string and query string respectively,

z

is the number of phrases in the LZ77 parse of the input string,

occ

is the number of occurrences of the query in the input and

\epsilon > 0

is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from

O(m\lg m)

O(m)

at the cost of increasing the space by a factor

\lg \lg z

. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of

O(m(1+\frac{\lg^{\epsilon} z}{\lg (n/z)}))

. However, for any polynomial compression ratio, i.e.,

z = O(n^{1-\delta})

, for constant

\delta > 0

, this becomes

O(m)

. Our index also supports extraction of any substring of length

\ell

O(\ell + \lg(n/z))

time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Understanding Space in Proof Complexity: Separations and Trade-offs via Substitutions

Author: Ben-Sasson Eli
Nordström Jakob
Publication venue
Publication date: 01/01/2010
Field of study

For current state-of-the-art DPLL SAT-solvers the two main bottlenecks are the amounts of time and memory used. In proof complexity, these resources correspond to the length and space of resolution proofs. There has been a long line of research investigating these proof complexity measures, but while strong results have been established for length, our understanding of space and how it relates to length has remained quite poor. In particular, the question whether resolution proofs can be optimized for length and space simultaneously, or whether there are trade-offs between these two measures, has remained essentially open. In this paper, we remedy this situation by proving a host of length-space trade-off results for resolution. Our collection of trade-offs cover almost the whole range of values for the space complexity of formulas, and most of the trade-offs are superpolynomial or even exponential and essentially tight. Using similar techniques, we show that these trade-offs in fact extend to the exponentially stronger k-DNF resolution proof systems, which operate with formulas in disjunctive normal form with terms of bounded arity k. We also answer the open question whether the k-DNF resolution systems form a strict hierarchy with respect to space in the affirmative. Our key technical contribution is the following, somewhat surprising, theorem: Any CNF formula F can be transformed by simple variable substitution into a new formula F' such that if F has the right properties, F' can be proven in essentially the same length as F, whereas on the other hand the minimal number of lines one needs to keep in memory simultaneously in any proof of F' is lower-bounded by the minimal number of variables needed simultaneously in any proof of F. Applying this theorem to so-called pebbling formulas defined in terms of pebble games on directed acyclic graphs, we obtain our results.Comment: This paper is a merged and updated version of the two ECCC technical reports TR09-034 and TR09-047, and it hence subsumes these two report

arXiv.org e-Print Archive

CiteSeerX

Copenhagen University Research Information System