9 research outputs found
Missing Mass of Rank-2 Markov Chains
Estimation of missing mass with the popular Good-Turing (GT) estimator is
well-understood in the case where samples are independent and identically
distributed (iid). In this article, we consider the same problem when the
samples come from a stationary Markov chain with a rank-2 transition matrix,
which is one of the simplest extensions of the iid case. We develop an upper
bound on the absolute bias of the GT estimator in terms of the spectral gap of
the chain and a tail bound on the occupancy of states. Borrowing tail bounds
from known concentration results for Markov chains, we evaluate the bound using
other parameters of the chain. The analysis, supported by simulations, suggests
that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared
error falling with number of samples at a rate that depends loosely on the
connectivity of the states in the chain
Optimal prediction of Markov chains with and without spectral gap
We study the following learning problem with dependent data: Observing a
trajectory of length from a stationary Markov chain with states, the
goal is to predict the next state. For , using
techniques from universal compression, the optimal prediction risk in
Kullback-Leibler divergence is shown to be , in contrast to the optimal rate of for previously shown in Falahatgar et al., 2016. These rates,
slower than the parametric rate of , can be attributed to the
memory in the data, as the spectral gap of the Markov chain can be arbitrarily
small. To quantify the memory effect, we study irreducible reversible chains
with a prescribed spectral gap. In addition to characterizing the optimal
prediction risk for two states, we show that, as long as the spectral gap is
not excessively small, the prediction risk in the Markov model is
, which coincides with that of an iid model with the same
number of parameters.Comment: 52 page
Learning Non-Parametric and High-Dimensional Distributions via Information-Theoretic Methods
Learning distributions that govern generation of data and estimation of related functionals are the foundations of many classical statistical problems. In the following dissertation we intend to investigate such topics when either the hypothesized model is non-parametric or the number of free parameters in the model grows along with the sample size. Especially, we study the above scenarios for the following class of problems with the goal of obtaining minimax rate-optimal methods for learning the target distributions when the sample size is finite. Our techniques are based on information-theoretic divergences and related mutual-information based methods. (i) Estimation in compound decision and empirical Bayes settings: To estimate the data-generating distribution, one often takes the following two-step approach. In the first step the statistician estimates the distribution of the parameters, either the empirical distribution or the postulated prior, and then in the second step plugs in the estimate to approximate the target of interest. In the literature, the estimation of empirical distribution is known as the compound decision problem and the estimation of prior is known as the problem of empirical Bayes. In our work we use the method of minimum-distance estimation for approximating these distributions. Considering certain discrete data setups, we show that the minimum-distance based method provides theoretically and practically sound choices for estimation. The computational and algorithmic aspects of the estimators are also analyzed. (ii) Prediction with Markov chains: Given observations from an unknown Markov chain, we study the problem of predicting the next entry in the trajectory. Existing analysis for such a dependent setup usually centers around concentration inequalities that uses various extraneous conditions on the mixing properties. This makes it difficult to achieve results independent of such restrictions. We introduce information-theoretic techniques to bypass such issues and obtain fundamental limits for the related minimax problems. We also analyze conditions on the mixing properties that produce a parametric rate of prediction errors