2,203 research outputs found
Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It
We empirically show that Bayesian inference can be inconsistent under
misspecification in simple linear regression problems, both in a model
averaging/selection and in a Bayesian ridge regression setting. We use the
standard linear model, which assumes homoskedasticity, whereas the data are
heteroskedastic, and observe that the posterior puts its mass on ever more
high-dimensional models as the sample size increases. To remedy the problem, we
equip the likelihood in Bayes' theorem with an exponent called the learning
rate, and we propose the Safe Bayesian method to learn the learning rate from
the data. SafeBayes tends to select small learning rates as soon the standard
posterior is not `cumulatively concentrated', and its results on our data are
quite encouraging.Comment: 70 pages, 20 figure
Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies
Lung cancer is among the most common cancers in the United States, in terms
of incidence and mortality. In 2009, it is estimated that more than 150,000
deaths will result from lung cancer alone. Genetic information is an extremely
valuable data source in characterizing the personal nature of cancer. Over the
past several years, investigators have conducted numerous association studies
where intensive genetic data is collected on relatively few patients compared
to the numbers of gene predictors, with one scientific goal being to identify
genetic features associated with cancer recurrence or survival. In this note,
we propose high-dimensional survival analysis through a new application of
boosting, a powerful tool in machine learning. Our approach is based on an
accelerated lifetime model and minimizing the sum of pairwise differences in
residuals. We apply our method to a recent microarray study of lung
adenocarcinoma and find that our ensemble is composed of 19 genes, while a
proportional hazards (PH) ensemble is composed of nine genes, a proper subset
of the 19-gene panel. In one of our simulation scenarios, we demonstrate that
PH boosting in a misspecified model tends to underfit and ignore
moderately-sized covariate effects, on average. Diagnostic analyses suggest
that the PH assumption is not satisfied in the microarray data and may explain,
in part, the discrepancy in the sets of active coefficients. Our simulation
studies and comparative data analyses demonstrate how statistical learning by
PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …