Search CORE

24 research outputs found

Empirical Lossless Compression Bound of a Data Sequence

Author: Li Lei M
Publication venue
Publication date: 22/01/2024
Field of study

We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity

nH({\hat \theta}_n)

obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, where

n

is the number of words. Shtarkov showed that the normalized maximum likelihood (NML) distribution or code length is optimal in a minimax sense for any parametric family. We show by the local asymptotic normality that the NML code length for the exponential families is

nH(\hat \theta_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \int_{\Theta} |I(\theta)|^{1/2}\, d\theta+o(1)

, where

d

is the model dimension or dictionary size, and

|I(\theta)|

is the determinant of the Fisher information matrix. We also demonstrate that sequentially predicting the optimal code length for the next word via a Bayesian mechanism leads to the mixture code, whose pathwise length is given by

nH({\hat \theta}_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \frac{|\, I({\hat \theta}_n)|^{1/2}}{w({\hat \theta}_n)}+o(1)

, where

w(\theta)

is a prior. The asymptotics apply to not only discrete symbols but also continuous data if the code length for the former is replaced by the description length for the latter. The analytical result is exemplified by calculating compression bounds of protein-encoding DNA sequences under different parsing models. Typically, the highest compression is achieved when the parsing is in phase of the amino acid codons. On the other hand, the compression rates of pseudo-random sequences are larger than 1 regardless parsing models. These model-based results are in consistency with that random sequences are incompressible as asserted by the Kolmogorov complexity theory. The empirical lossless compression bound is particularly more accurate when dictionary size is relatively large.Comment: 3 table

arXiv.org e-Print Archive

Computable Bayesian Compression for Uniformly Discretizable Statistical Models

Author: Debowski L. (Lukasz Jerzy)
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2009
Field of study

Supplementing Vovk and V'yugin's `if' statement, we show that Bayesian compression provides the best enumerable compression for parameter-typical data if and only if the parameter is Martin-L\"of random with respect to the prior. The result is derived for uniformly discretizable statistical models, introduced here. They feature the crucial property that given a~discretized parameter, we can compute how much data is needed to learn its value with little uncertainty. Exponential families and certain nonparametric models are shown to be uniformly discretizable

CWI's Institutional Repository

A model-based approach to selection of tag SNPs

Author: A Barron
A Thomas
AP Dempster
B Halldórsson
BV Halldórsson
CE Shannon
CS Carlson
CS Carlson
D Botstein
DC Crawford
DC Crawford
EC Anderson
Fengzhu Sun
G Schwarz
GA McVean
H Akaike
H Mannila
J Besag
JD Wall
JD Wall
JFC Kingman
JN Hirschhorn
K Zhang
K Zhang
K Zhang
L Breiman
L Excoffier
L Li
LE Baum
Lei M Li
LR Rabiner
M Koivisto
M Nothnagel
M Stephens
MJ Daly
N Li
N Patil
Pierre Nicolas
S Lin
SB Gabriel
SE Ptak
T Niu
TG Schulze
The International HapMap Consortium
TM Cover
W Zhai
X Ke
X Sun
Z Liu
Z Meng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

Hal-Diderot

Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma

Author: de Rooij Steven
Grunwald Peter
van Erven Tim
Publication venue
Publication date: 01/01/2008
Field of study

Bayesian model averaging, model selection and its approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates og convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can br inconsistent. We identify the "catch-up phenomenon" as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch distribution, a modification of the Bayesian marginal distribution. We show that, under broad conditions,model selection and prediction based on the switch distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient implementation. The switch distribution has a data compression interpretation, and can thus be viewed as a "prequential" or MDL method; yet it is different from the MDL methods that are usually considered in the literature. We compare the switch distribution to Bayes factor model selection and leave-one-out cross-validation.Comment: A preliminary version of a part of this paper appeared at the NIPS 2007 conferenc

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

Minimum Description Length Model Selection - Problems and Extensions

Author: Rooij S. (Steven) de
Publication venue
Publication date: 10/09/2008
Field of study

The thesis treats a number of open problems in Minimum Description Length model selection, especially prediction problems. It is shown how techniques from the "Prediction with Expert Advice" literature can be used to improve model selection performance, which is particularly useful in nonparametric settings

CWI's Institutional Repository

Annual Research Report 2020

Author
Publication venue: Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS) Leibniz-Institut im Forschungsverbund Berlin e. V.
Publication date: 01/01/2020
Field of study

Publications Server of the Weierstrass Institute for Applied Analysis and Stochastics

Annual Research Report 2021

Author
Publication venue: Weierstraß-Institut für Angewandte Analysis und Stochastik (WIAS) Leibniz-Institut im Forschungsverbund Berlin e. V.
Publication date: 01/01/2021
Field of study

Publications Server of the Weierstrass Institute for Applied Analysis and Stochastics

Iterated logarithmic expansions of the pathwise code lengths for exponential families

Author: Bin Yu
Lei Li
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Subject Index Volumes 1–200

Author
Publication venue: Published by Elsevier B.V.
Publication date
Field of study

Elsevier - Publisher Connector