483 research outputs found
A Bayesian Nonparametric Method for Prediction in EST Analysis
In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries studied in Susko and Roger (2004), with frequentist methods, are analyzed in detail.
Affine equivariant rank-weighted L-estimation of multivariate location
In the multivariate one-sample location model, we propose a class of flexible
robust, affine-equivariant L-estimators of location, for distributions invoking
affine-invariance of Mahalanobis distances of individual observations. An
involved iteration process for their computation is numerically illustrated.Comment: 16 pages, 4 figures, 6 table
Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics
Given a sample of size from a population of individuals belonging to
different species with unknown proportions, a popular problem of practical
interest consists in making inference on the probability that the
-th draw coincides with a species with frequency in the sample, for
any . This paper contributes to the methodology of Bayesian
nonparametric inference for . Specifically, under the general
framework of Gibbs-type priors we show how to derive credible intervals for a
Bayesian nonparametric estimation of , and we investigate the large
asymptotic behaviour of such an estimator. Of particular interest are
special cases of our results obtained under the specification of the two
parameter Poisson--Dirichlet prior and the normalized generalized Gamma prior,
which are two of the most commonly used Gibbs-type priors. With respect to
these two prior specifications, the proposed results are illustrated through a
simulation study and a benchmark Expressed Sequence Tags dataset. To the best
our knowledge, this illustration provides the first comparative study between
the two parameter Poisson--Dirichlet prior and the normalized generalized Gamma
prior in the context of Bayesian nonparemetric inference for
Bayesian nonparametric estimators derived from conditional Gibbs structures
We consider discrete nonparametric priors which induce Gibbs-type exchangeable random partitions and investigate their posterior behavior in detail. In particular, we deduce conditional distributions and the corresponding Bayesian nonparametric estimators, which can be readily exploited for predicting various features of additional samples. The results provide useful tools for genomic applications where prediction of future outcomes is required.Bayesian nonparametric inference; Exchangeable random partitions; Generalized factorial coeffcients; Generalized gamma process; Poisson-Dirichlet process; Population genetics.
Rediscovery of Good-Turing estimators via Bayesian nonparametrics
The problem of estimating discovery probabilities originated in the context
of statistical ecology, and in recent years it has become popular due to its
frequent appearance in challenging applications arising in genetics,
bioinformatics, linguistics, designs of experiments, machine learning, etc. A
full range of statistical approaches, parametric and nonparametric as well as
frequentist and Bayesian, has been proposed for estimating discovery
probabilities. In this paper we investigate the relationships between the
celebrated Good-Turing approach, which is a frequentist nonparametric approach
developed in the 1940s, and a Bayesian nonparametric approach recently
introduced in the literature. Specifically, under the assumption of a two
parameter Poisson-Dirichlet prior, we show that Bayesian nonparametric
estimators of discovery probabilities are asymptotically equivalent, for a
large sample size, to suitably smoothed Good-Turing estimators. As a by-product
of this result, we introduce and investigate a methodology for deriving exact
and asymptotic credible intervals to be associated with the Bayesian
nonparametric estimators of discovery probabilities. The proposed methodology
is illustrated through a comprehensive simulation study and the analysis of
Expressed Sequence Tags data generated by sequencing a benchmark complementary
DNA library
A probabilistic study of neural complexity
G. Edelman, O. Sporns, and G. Tononi have introduced the neural complexity of
a family of random variables, defining it as a specific average of mutual
information over subfamilies. We show that their choice of weights satisfies
two natural properties, namely exchangeability and additivity, and we call any
functional satisfying these two properties an intricacy. We classify all
intricacies in terms of probability laws on the unit interval and study the
growth rate of maximal intricacies when the size of the system goes to
infinity. For systems of a fixed size, we show that maximizers have small
support and exchangeable systems have small intricacy. In particular,
maximizing intricacy leads to spontaneous symmetry breaking and failure of
uniqueness.Comment: minor edit
Sparse adaptive Dirichlet-multinomial-like processes
Online estimation and modelling of i.i.d. data for short
sequences over large or complex ''alphabets'' is a ubiquitous
(sub)problem in machine learning, information theory, data
compression, statistical language processing, and document
analysis. The Dirichlet-Multinomial distribution (also called
Polya urn scheme) and extensions thereof are widely applied for
online i.i.d. estimation. Good a-priori choices for the
parameters in this regime are difficult to obtain though. I
derive an optimal adaptive choice for the main parameter via
tight, data-dependent redundancy bounds for a related model. The
1-line recommendation is to set the 'total mass' = 'precision' =
'concentration' parameter to m/2ln[(n+1)/m], where n
is the (past) sample size and m the number of different symbols
observed (so far). The resulting estimator is simple, online,
fast, and experimental performance is superb
A Bernstein-Von Mises Theorem for discrete probability distributions
We investigate the asymptotic normality of the posterior distribution in the
discrete setting, when model dimension increases with sample size. We consider
a probability mass function on \mathbbm{N}\setminus \{0\} and a
sequence of truncation levels satisfying Let denote the maximum likelihood estimate of
and let denote the
-dimensional vector which -th coordinate is defined by \sqrt{n}
(\hat{\theta}_n(i)-\theta_0(i)) for We check that under mild
conditions on and on the sequence of prior probabilities on the
-dimensional simplices, after centering and rescaling, the variation
distance between the posterior distribution recentered around
and rescaled by and the -dimensional Gaussian distribution
converges in probability to
This theorem can be used to prove the asymptotic normality of Bayesian
estimators of Shannon and R\'{e}nyi entropies. The proofs are based on
concentration inequalities for centered and non-centered Chi-square (Pearson)
statistics. The latter allow to establish posterior concentration rates with
respect to Fisher distance rather than with respect to the Hellinger distance
as it is commonplace in non-parametric Bayesian statistics.Comment: Published in at http://dx.doi.org/10.1214/08-EJS262 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …