185,074 research outputs found

    Longest Common Extensions in Sublinear Space

    Get PDF
    The longest common extension problem (LCE problem) is to construct a data structure for an input string TT of length nn that supports LCE(i,j)(i,j) queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions ii and jj in TT. This classic problem has a well-known solution that uses O(n)O(n) space and O(1)O(1) query time. In this paper we show that for any trade-off parameter 1τn1 \leq \tau \leq n, the problem can be solved in O(nτ)O(\frac{n}{\tau}) space and O(τ)O(\tau) query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

    Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

    Full text link
    Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in [a_l; a_r]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I_{[a_l;a_r]}, assuming that the size of internal memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I_{[a_l;a_r]} containing each element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}| log(1/epsilon)) bits.Comment: 16 page

    Exploring Superpage Promotion Policies for Efficient Address Translation

    Get PDF
    Address translation performance for modern applications depends heavily upon the number of translation entries cached in the hardware TLB (translation look-aside buffer). Therefore, the efficiency of address translation relies directly on the TLB hit rate. The number of TLB entries continues to fall further behind the growth of memory consumption for modern applications. Superpages, which are pages with larger sizes, can increase the efficiency of the TLB by enabling each translation entry to cover a larger memory region. Without requiring more TLB entries, using superpages can increase the TLB hit rate and benefit address translation. However, using superpages can bring overhead. The TLB uses a single dirty bit to mark a page as dirty during address translation before modifying the page, so the granularity of the dirty bit corresponds to the coverage of the translation entry. As a result, the OS (operating system) will pay extra I/O effort when it allocates or writes an underutilized superpage back to disk. Such extra overhead can easily surpass the address translation benefits of superpages. This thesis discusses the performance trade-offs of superpages by exploring the design space of superpage promotion policies in the OS. A data collection infrastructure is built based on QEMU with kernel instrumentation on FreeBSD to collaboratively collect both memory accesses and kernel events. Then, the TLB behavior of Intel Skylake x86 family processors is simulated. The simulation has been validated to be faithful and consistent with the real-world performance. Last, this thesis evaluates and compares both TLB performance benefits and I/O overheads among the superpage promotion policies to discuss the trade-offs in the design space

    Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing

    Get PDF
    Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in (i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or (ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1 In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search

    String Indexing for Top-kk Close Consecutive Occurrences

    Full text link
    The classic string indexing problem is to preprocess a string SS into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string PP, report all occurrences of PP within SS. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-kk close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair (i,j)(i,j), i<ji < j, such that PP occurs at positions ii and jj in SS and there is no occurrence of PP between ii and jj, and their distance is defined as jij-i. Given a pattern PP and a parameter kk, the goal is to report the top-kk consecutive occurrences of PP in SS of minimal distance. The challenge is to compactly represent SS while supporting queries in time close to length of PP and kk. We give two time-space trade-offs for the problem. Let nn be the length of SS, mm the length of PP, and ϵ(0,1]\epsilon\in(0,1]. Our first result achieves O(nlogn)O(n\log n) space and optimal query time of O(m+k)O(m+k), and our second result achieves linear space and query time O(m+k1+ϵ)O(m+k^{1+\epsilon}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

    String Indexing for Top-k Close Consecutive Occurrences

    Get PDF
    The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ? ? (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+?}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees

    Time-space trade-offs for lempel-ziv compressed indexing

    Get PDF
    Given a string SS, the \emph{compressed indexing problem} is to preprocess SS into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of SS while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets; (i) O(m+occlglgn)O(m + occ \lg\lg n) time using O(zlg(n/z)lglgz)O(z\lg(n/z)\lg\lg z) space, or (ii) O(m(1+lgϵzlg(n/z))+occ(lglgn+lgϵz))O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z)) time using O(zlg(n/z))O(z\lg(n/z)) space. For integer alphabets polynomially bounded by nn; (iii) O(m(1+lgϵzlg(n/z))+occ(lglgn+lgϵz))O(m(1 + \frac{\lg^\epsilon z}{\lg(n/z)}) + occ(\lg\lg n + \lg^\epsilon z)) time using O(z(lg(n/z)+lglgz))O(z(\lg(n/z) + \lg\lg z)) space, or (iv) O(m+occ(lglgn+lgϵz))O(m + occ(\lg\lg n + \lg^{\epsilon} z)) time using O(z(lg(n/z)+lgϵz))O(z(\lg(n/z) + \lg^{\epsilon} z)) space, where nn and mm are the length of the input string and query string respectively, zz is the number of phrases in the LZ77 parse of the input string, occocc is the number of occurrences of the query in the input and ϵ>0\epsilon > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm)O(m\lg m) to O(m)O(m) at the cost of increasing the space by a factor lglgz\lg \lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lgϵzlg(n/z)))O(m(1+\frac{\lg^{\epsilon} z}{\lg (n/z)})). However, for any polynomial compression ratio, i.e., z=O(n1δ)z = O(n^{1-\delta}), for constant δ>0\delta > 0, this becomes O(m)O(m). Our index also supports extraction of any substring of length \ell in O(+lg(n/z))O(\ell + \lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search

    Understanding Space in Proof Complexity: Separations and Trade-offs via Substitutions

    Full text link
    For current state-of-the-art DPLL SAT-solvers the two main bottlenecks are the amounts of time and memory used. In proof complexity, these resources correspond to the length and space of resolution proofs. There has been a long line of research investigating these proof complexity measures, but while strong results have been established for length, our understanding of space and how it relates to length has remained quite poor. In particular, the question whether resolution proofs can be optimized for length and space simultaneously, or whether there are trade-offs between these two measures, has remained essentially open. In this paper, we remedy this situation by proving a host of length-space trade-off results for resolution. Our collection of trade-offs cover almost the whole range of values for the space complexity of formulas, and most of the trade-offs are superpolynomial or even exponential and essentially tight. Using similar techniques, we show that these trade-offs in fact extend to the exponentially stronger k-DNF resolution proof systems, which operate with formulas in disjunctive normal form with terms of bounded arity k. We also answer the open question whether the k-DNF resolution systems form a strict hierarchy with respect to space in the affirmative. Our key technical contribution is the following, somewhat surprising, theorem: Any CNF formula F can be transformed by simple variable substitution into a new formula F' such that if F has the right properties, F' can be proven in essentially the same length as F, whereas on the other hand the minimal number of lines one needs to keep in memory simultaneously in any proof of F' is lower-bounded by the minimal number of variables needed simultaneously in any proof of F. Applying this theorem to so-called pebbling formulas defined in terms of pebble games on directed acyclic graphs, we obtain our results.Comment: This paper is a merged and updated version of the two ECCC technical reports TR09-034 and TR09-047, and it hence subsumes these two report
    corecore