5,825 research outputs found
Bayesian Approximate Kernel Regression with Variable Selection
Nonlinear kernel regression models are often used in statistics and machine
learning because they are more accurate than linear models. Variable selection
for kernel regression models is a challenge partly because, unlike the linear
regression setting, there is no clear concept of an effect size for regression
coefficients. In this paper, we propose a novel framework that provides an
effect size analog of each explanatory variable for Bayesian kernel regression
models when the kernel is shift-invariant --- for example, the Gaussian kernel.
We use function analytic properties of shift-invariant reproducing kernel
Hilbert spaces (RKHS) to define a linear vector space that: (i) captures
nonlinear structure, and (ii) can be projected onto the original explanatory
variables. The projection onto the original explanatory variables serves as an
analog of effect sizes. The specific function analytic property we use is that
shift-invariant kernel functions can be approximated via random Fourier bases.
Based on the random Fourier expansion we propose a computationally efficient
class of Bayesian approximate kernel regression (BAKR) models for both
nonlinear regression and binary classification for which one can compute an
analog of effect sizes. We illustrate the utility of BAKR by examining two
important problems in statistical genetics: genomic selection (i.e. phenotypic
prediction) and association mapping (i.e. inference of significant variants or
loci). State-of-the-art methods for genomic selection and association mapping
are based on kernel regression and linear models, respectively. BAKR is the
first method that is competitive in both settings.Comment: 22 pages, 3 figures, 3 tables; theory added; new simulations
presented; references adde
Covariate dimension reduction for survival data via the Gaussian process latent variable model
The analysis of high dimensional survival data is challenging, primarily due
to the problem of overfitting which occurs when spurious relationships are
inferred from data that subsequently fail to exist in test data. Here we
propose a novel method of extracting a low dimensional representation of
covariates in survival data by combining the popular Gaussian Process Latent
Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The
combined model offers a flexible non-linear probabilistic method of detecting
and extracting any intrinsic low dimensional structure from high dimensional
data. By reducing the covariate dimension we aim to diminish the risk of
overfitting and increase the robustness and accuracy with which we infer
relationships between covariates and survival outcomes. In addition, we can
simultaneously combine information from multiple data sources by expressing
multiple datasets in terms of the same low dimensional space. We present
results from several simulation studies that illustrate a reduction in
overfitting and an increase in predictive performance, as well as successful
detection of intrinsic dimensionality. We provide evidence that it is
advantageous to combine dimensionality reduction with survival outcomes rather
than performing unsupervised dimensionality reduction on its own. Finally, we
use our model to analyse experimental gene expression data and detect and
extract a low dimensional representation that allows us to distinguish high and
low risk groups with superior accuracy compared to doing regression on the
original high dimensional data
Statistical methods of SNP data analysis with applications
Various statistical methods important for genetic analysis are considered and
developed. Namely, we concentrate on the multifactor dimensionality reduction,
logic regression, random forests and stochastic gradient boosting. These
methods and their new modifications, e.g., the MDR method with "independent
rule", are used to study the risk of complex diseases such as cardiovascular
ones. The roles of certain combinations of single nucleotide polymorphisms and
external risk factors are examined. To perform the data analysis concerning the
ischemic heart disease and myocardial infarction the supercomputer SKIF
"Chebyshev" of the Lomonosov Moscow State University was employed
Ranking relations using analogies in biological and information networks
Analogical reasoning depends fundamentally on the ability to learn and
generalize about relations between objects. We develop an approach to
relational learning which, given a set of pairs of objects
,
measures how well other pairs A:B fit in with the set . Our work
addresses the following question: is the relation between objects A and B
analogous to those relations found in ? Such questions are
particularly relevant in information retrieval, where an investigator might
want to search for analogous pairs of objects that match the query set of
interest. There are many ways in which objects can be related, making the task
of measuring analogies very challenging. Our approach combines a similarity
measure on function spaces with Bayesian analysis to produce a ranking. It
requires data containing features of the objects of interest and a link matrix
specifying which relationships exist; no further attributes of such
relationships are necessary. We illustrate the potential of our method on text
analysis and information networks. An application on discovering functional
interactions between pairs of proteins is discussed in detail, where we show
that our approach can work in practice even if a small set of protein pairs is
provided.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS321 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …