4,725 research outputs found
Universal Compression of Power-Law Distributions
English words and the outputs of many other natural processes are well-known
to follow a Zipf distribution. Yet this thoroughly-established property has
never been shown to help compress or predict these important processes. We show
that the expected redundancy of Zipf distributions of order is
roughly the power of the expected redundancy of unrestricted
distributions. Hence for these orders, Zipf distributions can be better
compressed and predicted than was previously known. Unlike the expected case,
we show that worst-case redundancy is roughly the same for Zipf and for
unrestricted distributions. Hence Zipf distributions have significantly
different worst-case and expected redundancies, making them the first natural
distribution class shown to have such a difference.Comment: 20 page
Structural alphabets derived from attractors in conformational space
Background: The hierarchical and partially redundant nature of protein structures justifies the definition of frequently occurring conformations of short fragments as 'states'. Collections of selected representatives for these states define Structural Alphabets, describing the most typical local conformations within protein structures. These alphabets form a bridge between the string-oriented methods of sequence analysis and the coordinate-oriented methods of protein structure analysis.Results: A Structural Alphabet has been derived by clustering all four-residue fragments of a high-resolution subset of the protein data bank and extracting the high-density states as representative conformational states. Each fragment is uniquely defined by a set of three independent angles corresponding to its degrees of freedom, capturing in simple and intuitive terms the properties of the conformational space. The fragments of the Structural Alphabet are equivalent to the conformational attractors and therefore yield a most informative encoding of proteins. Proteins can be reconstructed within the experimental uncertainty in structure determination and ensembles of structures can be encoded with accuracy and robustness.Conclusions: The density-based Structural Alphabet provides a novel tool to describe local conformations and it is specifically suitable for application in studies of protein dynamics. © 2010 Pandini et al; licensee BioMed Central Ltd
Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Various optimality properties of universal sequence predictors based on
Bayes-mixtures in general, and Solomonoff's prediction scheme in particular,
will be studied. The probability of observing at time , given past
observations can be computed with the chain rule if the true
generating distribution of the sequences is known. If
is unknown, but known to belong to a countable or continuous class \M
one can base ones prediction on the Bayes-mixture defined as a
-weighted sum or integral of distributions \nu\in\M. The cumulative
expected loss of the Bayes-optimal universal prediction scheme based on
is shown to be close to the loss of the Bayes-optimal, but infeasible
prediction scheme based on . We show that the bounds are tight and that no
other predictor can lead to significantly smaller bounds. Furthermore, for
various performance measures, we show Pareto-optimality of and give an
Occam's razor argument that the choice for the weights
is optimal, where is the length of the shortest program describing
. The results are applied to games of chance, defined as a sequence of
bets, observations, and rewards. The prediction schemes (and bounds) are
compared to the popular predictors based on expert advice. Extensions to
infinite alphabets, partial, delayed and probabilistic prediction,
classification, and more active systems are briefly discussed.Comment: 34 page
About Adaptive Coding on Countable Alphabets: Max-Stable Envelope Classes
In this paper, we study the problem of lossless universal source coding for
stationary memoryless sources on countably infinite alphabets. This task is
generally not achievable without restricting the class of sources over which
universality is desired. Building on our prior work, we propose natural
families of sources characterized by a common dominating envelope. We
particularly emphasize the notion of adaptivity, which is the ability to
perform as well as an oracle knowing the envelope, without actually knowing it.
This is closely related to the notion of hierarchical universal source coding,
but with the important difference that families of envelope classes are not
discretely indexed and not necessarily nested.
Our contribution is to extend the classes of envelopes over which adaptive
universal source coding is possible, namely by including max-stable
(heavy-tailed) envelopes which are excellent models in many applications, such
as natural language modeling. We derive a minimax lower bound on the redundancy
of any code on such envelope classes, including an oracle that knows the
envelope. We then propose a constructive code that does not use knowledge of
the envelope. The code is computationally efficient and is structured to use an
{E}xpanding {T}hreshold for {A}uto-{C}ensoring, and we therefore dub it the
\textsc{ETAC}-code. We prove that the \textsc{ETAC}-code achieves the lower
bound on the minimax redundancy within a factor logarithmic in the sequence
length, and can be therefore qualified as a near-adaptive code over families of
heavy-tailed envelopes. For finite and light-tailed envelopes the penalty is
even less, and the same code follows closely previous results that explicitly
made the light-tailed assumption. Our technical results are founded on methods
from regular variation theory and concentration of measure
About adaptive coding on countable alphabets
This paper sheds light on universal coding with respect to classes of
memoryless sources over a countable alphabet defined by an envelope function
with finite and non-decreasing hazard rate. We prove that the auto-censuring AC
code introduced by Bontemps (2011) is adaptive with respect to the collection
of such classes. The analysis builds on the tight characterization of universal
redundancy rate in terms of metric entropy % of small source classes by Opper
and Haussler (1997) and on a careful analysis of the performance of the
AC-coding algorithm. The latter relies on non-asymptotic bounds for maxima of
samples from discrete distributions with finite and non-decreasing hazard rate
Topological transition in disordered planar matching: combinatorial arcs expansion
In this paper, we investigate analytically the properties of the disordered
Bernoulli model of planar matching. This model is characterized by a
topological phase transition, yielding complete planar matching solutions only
above a critical density threshold. We develop a combinatorial procedure of
arcs expansion that explicitly takes into account the contribution of short
arcs, and allows to obtain an accurate analytical estimation of the critical
value by reducing the global constrained problem to a set of local ones. As an
application to a toy representation of the RNA secondary structures, we suggest
generalized models that incorporate a one-to-one correspondence between the
contact matrix and the RNA-type sequence, thus giving sense to the notion of
effective non-integer alphabets.Comment: 28 pages, 6 figures, published versio
- âŠ