185,074 research outputs found
Longest Common Extensions in Sublinear Space
The longest common extension problem (LCE problem) is to construct a data
structure for an input string of length that supports LCE
queries. Such a query returns the length of the longest common prefix of the
suffixes starting at positions and in . This classic problem has a
well-known solution that uses space and query time. In this paper
we show that for any trade-off parameter , the problem can
be solved in space and query time. This
significantly improves the previously best known time-space trade-offs, and
almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201
Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string
over S. A "secondary index" for x answers alphabet range queries of the form:
Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in
[a_l; a_r]}. Secondary indexes are heavily used in relational databases and
scientific data analysis. It is well-known that the obvious solution, storing a
dictionary for the position set associated with each character, does not always
give optimal query time. In this paper we give the first theoretically optimal
data structure for the secondary indexing problem. In the I/O model, the amount
of data read when answering a query is within a constant factor of the minimum
space needed to represent I_{[a_l;a_r]}, assuming that the size of internal
memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space
usage of the data structure is O(n log |S|) bits in the worst case, and we
further show how to bound the size of the data structure in terms of the 0-th
order entropy of x. We show how to support updates achieving various time-space
trade-offs.
We also consider an approximate version of the basic secondary indexing
problem where a query reports a superset of I_{[a_l;a_r]} containing each
element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon >
0 is the false positive probability. For this problem the amount of data that
needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}|
log(1/epsilon)) bits.Comment: 16 page
Exploring Superpage Promotion Policies for Efficient Address Translation
Address translation performance for modern applications depends heavily upon the number of translation entries cached in the hardware TLB (translation look-aside buffer). Therefore, the efficiency of address translation relies directly on the TLB hit rate. The number of TLB entries continues to fall further behind the growth of memory consumption for modern applications. Superpages, which are pages with larger sizes, can increase the efficiency of the TLB by enabling each translation entry to cover a larger memory region. Without requiring more TLB entries, using superpages can increase the TLB hit rate and benefit address translation. However, using superpages can bring overhead. The TLB uses a single dirty bit to mark a page as dirty during address translation before modifying the page, so the granularity of the dirty bit corresponds to the coverage of the translation entry. As a result, the OS (operating system) will pay extra I/O effort when it allocates or writes an underutilized superpage back to disk. Such extra overhead can easily surpass the address translation benefits of superpages. This thesis discusses the performance trade-offs of superpages by exploring the design space of superpage promotion policies in the OS. A data collection infrastructure is built based on QEMU with kernel instrumentation on FreeBSD to collaboratively collect both memory accesses and kernel events. Then, the TLB behavior of Intel Skylake x86 family processors is simulated. The simulation has been validated to be faithful and consistent with the real-world performance. Last, this thesis evaluates and compares both TLB performance benefits and I/O overheads among the superpage promotion policies to discuss the trade-offs in the design space
Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing
Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in
(i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or
(ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1
In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search
String Indexing for Top- Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string into a
compact data structure that supports efficient subsequent pattern matching
queries, that is, given a pattern string , report all occurrences of
within . In this paper, we study a basic and natural extension of string
indexing called the string indexing for top- close consecutive occurrences
problem (SITCCO). Here, a consecutive occurrence is a pair , ,
such that occurs at positions and in and there is no occurrence
of between and , and their distance is defined as . Given a
pattern and a parameter , the goal is to report the top- consecutive
occurrences of in of minimal distance. The challenge is to compactly
represent while supporting queries in time close to length of and .
We give two time-space trade-offs for the problem. Let be the length of
, the length of , and . Our first result achieves
space and optimal query time of , and our second result
achieves linear space and query time . Along the way, we
develop several techniques of independent interest, including a new translation
of the problem into a line segment intersection problem and a new recursive
clustering technique for trees.Comment: Fixed typos, minor change
String Indexing for Top-k Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees
Time-space trade-offs for lempel-ziv compressed indexing
Given a string , the \emph{compressed indexing problem} is to preprocess
into a compressed representation that supports fast \emph{substring
queries}. The goal is to use little space relative to the compressed size of
while supporting fast queries. We present a compressed index based on the
Lempel--Ziv 1977 compression scheme. We obtain the following time-space
trade-offs: For constant-sized alphabets; (i) time using
space, or (ii) time using space. For integer
alphabets polynomially bounded by ; (iii) time using space, or (iv) time using
space, where and are the length of
the input string and query string respectively, is the number of phrases in
the LZ77 parse of the input string, is the number of occurrences of the
query in the input and is an arbitrarily small constant. In
particular, (i) improves the leading term in the query time of the previous
best solution from to at the cost of increasing the space by
a factor . Alternatively, (ii) matches the previous best space
bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes . Our index
also supports extraction of any substring of length in time. Technically, our results are obtained by novel extensions and
combinations of existing data structures of independent interest, including a
new batched variant of weak prefix search
Understanding Space in Proof Complexity: Separations and Trade-offs via Substitutions
For current state-of-the-art DPLL SAT-solvers the two main bottlenecks are
the amounts of time and memory used. In proof complexity, these resources
correspond to the length and space of resolution proofs. There has been a long
line of research investigating these proof complexity measures, but while
strong results have been established for length, our understanding of space and
how it relates to length has remained quite poor. In particular, the question
whether resolution proofs can be optimized for length and space simultaneously,
or whether there are trade-offs between these two measures, has remained
essentially open.
In this paper, we remedy this situation by proving a host of length-space
trade-off results for resolution. Our collection of trade-offs cover almost the
whole range of values for the space complexity of formulas, and most of the
trade-offs are superpolynomial or even exponential and essentially tight. Using
similar techniques, we show that these trade-offs in fact extend to the
exponentially stronger k-DNF resolution proof systems, which operate with
formulas in disjunctive normal form with terms of bounded arity k. We also
answer the open question whether the k-DNF resolution systems form a strict
hierarchy with respect to space in the affirmative.
Our key technical contribution is the following, somewhat surprising,
theorem: Any CNF formula F can be transformed by simple variable substitution
into a new formula F' such that if F has the right properties, F' can be proven
in essentially the same length as F, whereas on the other hand the minimal
number of lines one needs to keep in memory simultaneously in any proof of F'
is lower-bounded by the minimal number of variables needed simultaneously in
any proof of F. Applying this theorem to so-called pebbling formulas defined in
terms of pebble games on directed acyclic graphs, we obtain our results.Comment: This paper is a merged and updated version of the two ECCC technical
reports TR09-034 and TR09-047, and it hence subsumes these two report
- …