3,210 research outputs found
Dynamic Data Structures for Document Collections and Graphs
In the dynamic indexing problem, we must maintain a changing collection of
text documents so that we can efficiently support insertions, deletions, and
pattern matching queries. We are especially interested in developing efficient
data structures that store and query the documents in compressed form. All
previous compressed solutions to this problem rely on answering rank and select
queries on a dynamic sequence of symbols. Because of the lower bound in
[Fredman and Saks, 1989], answering rank queries presents a bottleneck in
compressed dynamic indexing. In this paper we show how this lower bound can be
circumvented using our new framework. We demonstrate that the gap between
static and dynamic variants of the indexing problem can be almost closed. Our
method is based on a novel framework for adding dynamism to static compressed
data structures. Our framework also applies more generally to dynamizing other
problems. We show, for example, how our framework can be applied to develop
compressed representations of dynamic graphs and binary relations
Consistency of Feature Markov Processes
We are studying long term sequence prediction (forecasting). We approach this
by investigating criteria for choosing a compact useful state representation.
The state is supposed to summarize useful information from the history. We want
a method that is asymptotically consistent in the sense it will provably
eventually only choose between alternatives that satisfy an optimality property
related to the used criterion. We extend our work to the case where there is
side information that one can take advantage of and, furthermore, we briefly
discuss the active setting where an agent takes actions to achieve desirable
outcomes.Comment: 16 LaTeX page
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
Linear Compressed Pattern Matching for Polynomial Rewriting (Extended Abstract)
This paper is an extended abstract of an analysis of term rewriting where the
terms in the rewrite rules as well as the term to be rewritten are compressed
by a singleton tree grammar (STG). This form of compression is more general
than node sharing or representing terms as dags since also partial trees
(contexts) can be shared in the compression. In the first part efficient but
complex algorithms for detecting applicability of a rewrite rule under
STG-compression are constructed and analyzed. The second part applies these
results to term rewriting sequences.
The main result for submatching is that finding a redex of a left-linear rule
can be performed in polynomial time under STG-compression.
The main implications for rewriting and (single-position or parallel)
rewriting steps are: (i) under STG-compression, n rewriting steps can be
performed in nondeterministic polynomial time. (ii) under STG-compression and
for left-linear rewrite rules a sequence of n rewriting steps can be performed
in polynomial time, and (iii) for compressed rewrite rules where the left hand
sides are either DAG-compressed or ground and STG-compressed, and an
STG-compressed target term, n rewriting steps can be performed in polynomial
time.Comment: In Proceedings TERMGRAPH 2013, arXiv:1302.599
- …