Search CORE

4,077 research outputs found

Streaming Partitioning of Sequences and Trees

Author: Konrad Christian
Publication venue
Publication date: 01/01/2016
Field of study

We study streaming algorithms for partitioning integer sequences and trees. In the case of trees, we suppose that the input tree is provided by a stream consisting of a depth-first-traversal of the input tree. This captures the problem of partitioning XML streams, among other problems. We show that both problems admit deterministic (1+epsilon)-approximation streaming algorithms, where a single pass is sufficient for integer sequences and two passes are required for trees. The space complexity for partitioning integer sequences is O((1/epsilon) * p * log(nm)) and for partitioning trees is O((1/epsilon) * p^2 * log(nm)), where n is the length of the input stream, m is the maximal weight of an element in the stream, and p is the number of partitions to be created. Furthermore, for the problem of partitioning integer sequences, we show that computing an optimal solution in one pass requires Omega(n) space, and computing a (1+epsilon)-approximation in one pass requires Omega((1/epsilon) * log(n)) space, rendering our algorithm tight for instances with p,m in O(1)

Dagstuhl Research Online Publication Server

Explore Bristol Research

Handling Massive N-Gram Datasets Efficiently

Author: Pibiri Giulio Ermanno
Venturini Rossano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/06/2018
Field of study

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time

Author: Golan Shay
Kociumaka Tomasz
Kopelowitz Tsvi
Porat Ely
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

We revisit the

k

-mismatch problem in the streaming model on a pattern of length

m

and a streaming text of length

n

, both over a size-

\sigma

alphabet. The current state-of-the-art algorithm for the streaming

k

-mismatch problem, by Clifford et al. [SODA 2019], uses

\tilde O(k)

space and

\tilde O\big(\sqrt k\big)

worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is

\tilde O(n\sqrt k)

, and the fastest known offline algorithm, which costs

\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},\sigma n\big)\big)

time. Moreover, it is not known whether improvements over the

\tilde O(n\sqrt k)

total time are possible when using more than

O(k)

space. We address these gaps by designing a randomized streaming algorithm for the

k

-mismatch problem that, given an integer parameter

k\le s \le m

, uses

\tilde O(s)

space and costs

\tilde O\big(n+\min\big(\frac {nk^2}m,\frac{nk}{\sqrt s},\frac{\sigma nm}s\big)\big)

total time. For

s=m

, the total runtime becomes

\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},\sigma n\big)\big)

, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still

\tilde O\big(\sqrt k\big)

.Comment: Extended abstract to appear in CPM 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

MonetDB/XQuery: a fast XQuery processor powered by a relational engine

Author: Boncz P.
Grust T.
Keulen M. van
Manegold S.
Rittinger J.
Teubner J.
Publication venue: ACM Press
Publication date: 01/01/2006
Field of study

Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the-art with a number of new technical contributions, such as loop-lifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11GB. The performance section also provides an extensive benchmark comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met

CiteSeerX

Crossref

CWI's Institutional Repository

University of Twente Research Information