2,406 research outputs found
Implementing a Class of Permutation Tests: The coin Package
The R package coin implements a unified approach to permutation tests providing a huge class of independence tests for nominal, ordered, numeric, and censored data as well as multivariate data at mixed scales. Based on a rich and flexible conceptual framework that embeds different permutation test procedures into a common theory, a computational framework is established in coin that likewise embeds the corresponding R functionality in a common S4 class structure with associated generic functions. As a consequence, the computational tools in coin inherit the flexibility of the underlying theory and conditional inference functions for important special cases can be set up easily. Conditional versions of classical tests---such as tests for location and scale problems in two or more samples, independence in two- or three-way contingency tables, or association problems for censored, ordered categorical or multivariate data---can easily be implemented as special cases using this computational toolbox by choosing appropriate transformations of the observations. The paper gives a detailed exposition of both the internal structure of the package and the provided user interfaces along with examples on how to extend the implemented functionality.
Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines
DNA copy number and mRNA expression are widely used data types in cancer
studies, which combined provide more insight than separately. Whereas in
existing literature the form of the relationship between these two types of
markers is fixed a priori, in this paper we model their association. We employ
piecewise linear regression splines (PLRS), which combine good interpretation
with sufficient flexibility to identify any plausible type of relationship. The
specification of the model leads to estimation and model selection in a
constrained, nonstandard setting. We provide methodology for testing the effect
of DNA on mRNA and choosing the appropriate model. Furthermore, we present a
novel approach to obtain reliable confidence bands for constrained PLRS, which
incorporates model uncertainty. The procedures are applied to colorectal and
breast cancer data. Common assumptions are found to be potentially misleading
for biologically relevant genes. More flexible models may bring more insight in
the interaction between the two markers.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS605 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fast marginal likelihood estimation of penalties for group-adaptive elastic net
Nowadays, clinical research routinely uses omics data, such as gene
expression, for predicting clinical outcomes or selecting markers.
Additionally, so-called co-data are often available, providing complementary
information on the covariates, like p-values from previously published studies
or groups of genes corresponding to pathways. Elastic net penalisation is
widely used for prediction and covariate selection. Group-adaptive elastic net
penalisation learns from co-data to improve the prediction and covariate
selection, by penalising important groups of covariates less than other groups.
Existing methods are, however, computationally expensive. Here we present a
fast method for marginal likelihood estimation of group-adaptive elastic net
penalties for generalised linear models. We first derive a low-dimensional
representation of the Taylor approximation of the marginal likelihood and its
first derivative for group-adaptive ridge penalties, to efficiently estimate
these penalties. Then we show by using asymptotic normality of the linear
predictors that the marginal likelihood for elastic net models may be
approximated well by the marginal likelihood for ridge models. The ridge group
penalties are then transformed to elastic net group penalties by using the
variance function. The method allows for overlapping groups and unpenalised
variables. We demonstrate the method in a model-based simulation study and an
application to cancer genomics. The method substantially decreases computation
time and outperforms or matches other methods by learning from co-data.Comment: 16 pages, 6 figures, 1 tabl
Linked shrinkage to improve estimation of interaction effects in regression models
We address a classical problem in statistics: adding two-way interaction
terms to a regression model. As the covariate dimension increases
quadratically, we develop an estimator that adapts well to this increase, while
providing accurate estimates and appropriate inference. Existing strategies
overcome the dimensionality problem by only allowing interactions between
relevant main effects. Building on this philosophy, we implement a softer link
between the two types of effects using a local shrinkage model. We empirically
show that borrowing strength between the amount of shrinkage for main effects
and their interactions can strongly improve estimation of the regression
coefficients. Moreover, we evaluate the potential of the model for inference,
which is notoriously hard for selection strategies. Large-scale cohort data are
used to provide realistic illustrations and evaluations. Comparisons with other
methods are provided. The evaluation of variable importance is not trivial in
regression models with many interaction terms. Therefore, we derive a new
analytical formula for the Shapley value, which enables rapid assessment of
individual-specific variable importance scores and their uncertainties.
Finally, while not targeting for prediction, we do show that our models can be
very competitive to a more advanced machine learner, like random forest, even
for fairly large sample sizes. The implementation of our method in RStan is
fairly straightforward, allowing for adjustments to specific needs.Comment: 28 pages, 18 figure
Symbolic computation and exact distributions of nonparametric test statistics
We show how to use computer algebra for computing exact distributions on nonparametric statistics. We give several examples of nonparametric statistics with explicit probability generating functions that can be handled this way. In particular, we give a new table of critical values of the Jonckheere-Terpstra test that extends tables known in the literature
Fast cross-validation for multi-penalty ridge regression
High-dimensional prediction with multiple data types needs to account for
potentially strong differences in predictive signal. Ridge regression is a
simple model for high-dimensional data that has challenged the predictive
performance of many more complex models and learners, and that allows inclusion
of data type specific penalties. The largest challenge for multi-penalty ridge
is to optimize these penalties efficiently in a cross-validation (CV) setting,
in particular for GLM and Cox ridge regression, which require an additional
estimation loop by iterative weighted least squares (IWLS). Our main
contribution is a computationally very efficient formula for the multi-penalty,
sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly
all computations are in low-dimensional space, rendering a speed-up of several
orders of magnitude. We developed a flexible framework that facilitates
multiple types of response, unpenalized covariates, several performance
criteria and repeated CV. Extensions to paired and preferential data types are
included and illustrated on several cancer genomics survival prediction
problems. Moreover, we present similar computational shortcuts for maximum
marginal likelihood and Bayesian probit regression. The corresponding
R-package, multiridge, serves as a versatile standalone tool, but also as a
fast benchmark for other more complex models and multi-view learners
Normalized, Segmented or Called aCGH Data?
Array comparative genomic hybridization (aCGH) is a high-throughput lab technique to measure genome-wide chromosomal copy numbers. Data from aCGH experiments require extensive pre-processing, which consists of three steps: normalization, segmentation and calling. Each of these pre-processing steps yields a different data set: normalized data, segmented data, and called data. Publications using aCGH base their findings on data from all stages of the pre-processing. Hence, there is no consensus on which should be used for further down-stream analysis. This consensus is however important for correct reporting of findings, and comparison of results from different studies. We discuss several issues that should be taken into account when deciding on which data are to be used. We express the believe that called data are best used, but would welcome opposing views
- ā¦