55 research outputs found
Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of normal means
We consider the classical problem of estimating a vector
\bolds{\mu}=(\mu_1,...,\mu_n) based on independent observations , . Suppose , are independent
realizations from a completely unknown . We suggest an easily computed
estimator \hat{\bolds{\mu}}, such that the ratio of its risk
E(\hat{\bolds{\mu}}-\bolds{\mu})^2 with that of the Bayes procedure
approaches 1. A related compound decision result is also obtained. Our
asymptotics is of a triangular array; that is, we allow the distribution to
depend on . Thus, our theoretical asymptotic results are also meaningful in
situations where the vector \bolds{\mu} is sparse and the proportion of zero
coordinates approaches 1. We demonstrate the performance of our estimator in
simulations, emphasizing sparse setups. In ``moderately-sparse'' situations,
our procedure performs very well compared to known procedures tailored for
sparse setups. It also adapts well to nonsparse situations.Comment: Published in at http://dx.doi.org/10.1214/08-AOS630 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Re-calibration of sample means
We consider the problem of calibration and the GREG method as suggested and
studied in Deville and Sarndal (1992). We show that a GREG type estimator is
typically not minimal variance unbiased estimator even asymptotically. We
suggest a similar estimator which is unbiased but is asymptotically with a
minimal variance
Two-Sided Sequential Tests
Let Xi be i.i.d. Xi∼Fθ. For some parametric families {Fθ}, we describe a monotonicity property of Bayes sequential procedures for the decision problem H0:θ = 0 versus H1:θ ≠ 0. A surprising counterexample is given in the case where Fθ is N(θ,1)
Active site prediction using evolutionary and structural information
Motivation: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites
Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity
We study the problem of aggregation under the squared loss in the model of
regression with deterministic design. We obtain sharp PAC-Bayesian risk bounds
for aggregates defined via exponential weights, under general assumptions on
the distribution of errors and on the functions to aggregate. We then apply
these results to derive sparsity oracle inequalities
Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases
<p>Abstract</p> <p>Background</p> <p>Identifying quantitative trait loci (QTL) for both additive and epistatic effects raises the statistical issue of selecting variables from a large number of candidates using a small number of observations. Missing trait and/or marker values prevent one from directly applying the classical model selection criteria such as Akaike's information criterion (AIC) and Bayesian information criterion (BIC).</p> <p>Results</p> <p>We propose a two-step Bayesian variable selection method which deals with the sparse parameter space and the small sample size issues. The regression coefficient priors are flexible enough to incorporate the characteristic of "large <it>p </it>small <it>n</it>" data. Specifically, sparseness and possible asymmetry of the significant coefficients are dealt with by developing a Gibbs sampling algorithm to stochastically search through low-dimensional subspaces for significant variables. The superior performance of the approach is demonstrated via simulation study. We also applied it to real QTL mapping datasets.</p> <p>Conclusion</p> <p>The two-step procedure coupled with Bayesian classification offers flexibility in modeling "large p small n" data, especially for the sparse and asymmetric parameter space. This approach can be extended to other settings characterized by high dimension and low sample size.</p
- …