4,077 research outputs found

    Streaming Partitioning of Sequences and Trees

    Get PDF
    We study streaming algorithms for partitioning integer sequences and trees. In the case of trees, we suppose that the input tree is provided by a stream consisting of a depth-first-traversal of the input tree. This captures the problem of partitioning XML streams, among other problems. We show that both problems admit deterministic (1+epsilon)-approximation streaming algorithms, where a single pass is sufficient for integer sequences and two passes are required for trees. The space complexity for partitioning integer sequences is O((1/epsilon) * p * log(nm)) and for partitioning trees is O((1/epsilon) * p^2 * log(nm)), where n is the length of the input stream, m is the maximal weight of an element in the stream, and p is the number of partitions to be created. Furthermore, for the problem of partitioning integer sequences, we show that computing an optimal solution in one pass requires Omega(n) space, and computing a (1+epsilon)-approximation in one pass requires Omega((1/epsilon) * log(n)) space, rendering our algorithm tight for instances with p,m in O(1)

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time

    Get PDF
    We revisit the kk-mismatch problem in the streaming model on a pattern of length mm and a streaming text of length nn, both over a size-σ\sigma alphabet. The current state-of-the-art algorithm for the streaming kk-mismatch problem, by Clifford et al. [SODA 2019], uses O~(k)\tilde O(k) space and O~(k)\tilde O\big(\sqrt k\big) worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is O~(nk)\tilde O(n\sqrt k), and the fastest known offline algorithm, which costs O~(n+min(nkm,σn))\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},\sigma n\big)\big) time. Moreover, it is not known whether improvements over the O~(nk)\tilde O(n\sqrt k) total time are possible when using more than O(k)O(k) space. We address these gaps by designing a randomized streaming algorithm for the kk-mismatch problem that, given an integer parameter ksmk\le s \le m, uses O~(s)\tilde O(s) space and costs O~(n+min(nk2m,nks,σnms))\tilde O\big(n+\min\big(\frac {nk^2}m,\frac{nk}{\sqrt s},\frac{\sigma nm}s\big)\big) total time. For s=ms=m, the total runtime becomes O~(n+min(nkm,σn))\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},\sigma n\big)\big), which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still O~(k)\tilde O\big(\sqrt k\big).Comment: Extended abstract to appear in CPM 202

    MonetDB/XQuery: a fast XQuery processor powered by a relational engine

    Get PDF
    Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the-art with a number of new technical contributions, such as loop-lifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11GB. The performance section also provides an extensive benchmark comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met
    corecore