3,009 research outputs found
Consistency of Feature Markov Processes
We are studying long term sequence prediction (forecasting). We approach this
by investigating criteria for choosing a compact useful state representation.
The state is supposed to summarize useful information from the history. We want
a method that is asymptotically consistent in the sense it will provably
eventually only choose between alternatives that satisfy an optimality property
related to the used criterion. We extend our work to the case where there is
side information that one can take advantage of and, furthermore, we briefly
discuss the active setting where an agent takes actions to achieve desirable
outcomes.Comment: 16 LaTeX page
Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences
We present a new method for nonlinear prediction of discrete random sequences
under minimal structural assumptions. We give a mathematical construction for
optimal predictors of such processes, in the form of hidden Markov models. We
then describe an algorithm, CSSR (Causal-State Splitting Reconstruction), which
approximates the ideal predictor from data. We discuss the reliability of CSSR,
its data requirements, and its performance in simulations. Finally, we compare
our approach to existing methods using variable-length Markov models and
cross-validated hidden Markov models, and show theoretically and experimentally
that our method delivers results superior to the former and at least comparable
to the latter.Comment: 8 pages, 4 figure
Malleable coding for updatable cloud caching
In software-as-a-service applications provisioned through cloud computing, locally cached data are often modified with updates from new versions. In some cases, with each edit, one may want to preserve both the original and new versions. In this paper, we focus on cases in which only the latest version must be preserved. Furthermore, it is desirable for the data to not only be compressed but to also be easily modified during updates, since representing information and modifying the representation both incur cost. We examine whether it is possible to have both compression efficiency and ease of alteration, in order to promote codeword reuse. In other words, we study the feasibility of a malleable and efficient coding scheme. The tradeoff between compression efficiency and malleability cost-the difficulty of synchronizing compressed versions-is measured as the length of a reused prefix portion. The region of achievable rates and malleability is found. Drawing from prior work on common information problems, we show that efficient data compression may not be the best engineering design principle when storing software-as-a-service data. In the general case, goals of efficiency and malleability are fundamentally in conflict.This work was supported in part by an NSF Graduate Research Fellowship (LRV), Grant CCR-0325774, and Grant CCF-0729069. This work was presented at the 2011 IEEE International Symposium on Information Theory [1] and the 2014 IEEE International Conference on Cloud Engineering [2]. The associate editor coordinating the review of this paper and approving it for publication was R. Thobaben. (CCR-0325774 - NSF Graduate Research Fellowship; CCF-0729069 - NSF Graduate Research Fellowship)Accepted manuscrip
Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction
We present a Markov part-of-speech tagger for which the P (w|t) emission probabilities of word w given tag t are replaced by a linear interpolation of tag emission probabilities given a list of representations of w. As word representations, string su#xes of w are cut o# at the local maxima of the Normalized Backward Successor Variety. This procedure allows for the derivation of linguistically meaningful string suffixes that may relate to certain POS labels. Since no linguistic knowledge is needed, the procedure is language independent. Basic Markov model part-of-speech taggers are significantly outperformed by our model
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
Variable length Markov chains and dynamical sources
Infinite random sequences of letters can be viewed as stochastic chains or as
strings produced by a source, in the sense of information theory. The
relationship between Variable Length Markov Chains (VLMC) and probabilistic
dynamical sources is studied. We establish a probabilistic frame for context
trees and VLMC and we prove that any VLMC is a dynamical source for which we
explicitly build the mapping. On two examples, the ``comb'' and the ``bamboo
blossom'', we find a necessary and sufficient condition for the existence and
the unicity of a stationary probability measure for the VLMC. These two
examples are detailed in order to provide the associated Dirichlet series as
well as the generating functions of word occurrences.Comment: 45 pages, 15 figure
- …