24 research outputs found
Empirical Lossless Compression Bound of a Data Sequence
We consider the lossless compression bound of any individual data sequence.
If we fit the data by a parametric model, the entropy quantity obtained by plugging in the maximum likelihood estimate is an
underestimate of the bound, where is the number of words. Shtarkov showed
that the normalized maximum likelihood (NML) distribution or code length is
optimal in a minimax sense for any parametric family. We show by the local
asymptotic normality that the NML code length for the exponential families is
, where is the model dimension or
dictionary size, and is the determinant of the Fisher information
matrix. We also demonstrate that sequentially predicting the optimal code
length for the next word via a Bayesian mechanism leads to the mixture code,
whose pathwise length is given by , where is a prior. The asymptotics apply to not
only discrete symbols but also continuous data if the code length for the
former is replaced by the description length for the latter. The analytical
result is exemplified by calculating compression bounds of protein-encoding DNA
sequences under different parsing models. Typically, the highest compression is
achieved when the parsing is in phase of the amino acid codons. On the other
hand, the compression rates of pseudo-random sequences are larger than 1
regardless parsing models. These model-based results are in consistency with
that random sequences are incompressible as asserted by the Kolmogorov
complexity theory. The empirical lossless compression bound is particularly
more accurate when dictionary size is relatively large.Comment: 3 table
Computable Bayesian Compression for Uniformly Discretizable Statistical Models
Supplementing Vovk and V'yugin's `if' statement, we show that
Bayesian compression provides the best enumerable compression for
parameter-typical data if and only if the parameter is Martin-L\"of
random with respect to the prior. The result is derived for
uniformly discretizable statistical models, introduced here. They
feature the crucial property that given a~discretized parameter, we
can compute how much data is needed to learn its value with little
uncertainty. Exponential families and certain nonparametric models
are shown to be uniformly discretizable
A model-based approach to selection of tag SNPs
BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available
Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma
Bayesian model averaging, model selection and its approximations such as BIC
are generally statistically consistent, but sometimes achieve slower rates og
convergence than other methods such as AIC and leave-one-out cross-validation.
On the other hand, these other methods can br inconsistent. We identify the
"catch-up phenomenon" as a novel explanation for the slow convergence of
Bayesian methods. Based on this analysis we define the switch distribution, a
modification of the Bayesian marginal distribution. We show that, under broad
conditions,model selection and prediction based on the switch distribution is
both consistent and achieves optimal convergence rates, thereby resolving the
AIC-BIC dilemma. The method is practical; we give an efficient implementation.
The switch distribution has a data compression interpretation, and can thus be
viewed as a "prequential" or MDL method; yet it is different from the MDL
methods that are usually considered in the literature. We compare the switch
distribution to Bayes factor model selection and leave-one-out
cross-validation.Comment: A preliminary version of a part of this paper appeared at the NIPS
2007 conferenc
Minimum Description Length Model Selection - Problems and Extensions
The thesis treats a number of open problems in Minimum Description Length model selection, especially prediction problems. It is shown how techniques from the "Prediction with Expert Advice" literature can be used to improve model selection performance, which is particularly useful in nonparametric settings