2,634 research outputs found
Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks
We present a procedure for effective estimation of entropy and mutual
information from small-sample data, and apply it to the problem of inferring
high-dimensional gene association networks. Specifically, we develop a
James-Stein-type shrinkage estimator, resulting in a procedure that is highly
efficient statistically as well as computationally. Despite its simplicity, we
show that it outperforms eight other entropy estimation procedures across a
diverse range of sampling scenarios and data-generating models, even in cases
of severe undersampling. We illustrate the approach by analyzing E. coli gene
expression data and computing an entropy-based gene-association network from
gene expression data. A computer program is available that implements the
proposed shrinkage estimator.Comment: 18 pages, 3 figures, 1 tabl
Structured parameter estimation for LFG-DOP using Backoff
Despite its state-of-the-art performance, the Data Oriented
Parsing (DOP) model has been shown to suffer from biased parameter estimation, and the good performance seems more the result of ad hoc adjustments than correct probabilistic generalization over the data. In recent work, we developed a new estimation procedure, called Backoff Estimation, for
DOP models that are based on Phrase-Structure annotations
(so called Tree-DOP models). Backoff Estimation deviates from earlier methods in that it treats the model parameters as a highly structured space of correlated events (backoffs), rather than a set of disjoint events. In this paper we show that the problem of biased estimates also holds for DOP models that are based on Lexical-Functional Grammar annotations (i.e. LFG-DOP), and that the LFG-DOP parameters also constitute a hierarchically structured space. Subsequently, we adapt the Backoff Estimation algorithm from Tree-DOP to LFG-DOP models. Backoff
Estimation turns out to be a natural solution to some
of the specific problems of robust parsing under LFGDOP
Rare Probability Estimation under Regularly Varying Heavy Tails
This paper studies the problem of estimating the probability of symbols that have occurred very rarely, in samples drawn independently from an unknown, possibly infinite, discrete distribution. In particular, we study the multiplicative consistency of estimators, defined as the ratio of the estimate to the true quantity converging to one. We first show that the classical Good-Turing estimator is not universally consistent in this sense, despite enjoying favorable additive properties. We then use Karamata's theory of regular variation to prove that regularly varying heavy tails are sufficient for consistency. At the core of this result is a multiplicative concentration that we establish both by extending the McAllester-Ortiz additive concentration for the missing mass to all rare probabilities and by exploiting regular variation. We also derive a family of estimators which, in addition to being consistent, address some of the shortcomings of the Good-Turing estimator. For example, they perform smoothing implicitly and have the absolute discounting structure of many heuristic algorithms. This also establishes a discrete parallel to extreme value theory, and many of the techniques therein can be adapted to the framework that we set forth.National Science Foundation (U.S.) (Grant 6922470)United States. Office of Naval Research (Grant 6918937
- âŠ