2,685 research outputs found
Use of pre-transformation to cope with outlying values in important candidate genes
Outlying values in predictors often strongly affect the results of statistical analyses in high-dimensional settings. Although they frequently occur with most high-throughput techniques, the problem is often ignored in the literature. We suggest to use a very simple transformation, proposed before in a different context by Royston and Sauerbrei, as an intermediary step between array normalization and high-level statistical analysis. This straightforward univariate transformation identifies extreme values and reduces the influence of outlying values considerably in all further steps of statistical analysis without eliminating the incriminated observation or feature. The use of the transformation and its effects are demonstrated for diverse univariate and multivariate statistical analyses using nine publicly available microarray data sets
A Regression-Based Differential Expression Detection Algorithm for Microarray Studies with Ultra-Low Sample Size
Global gene expression analysis using microarrays and, more recently, RNA-seq, has allowed investigators to understand biological processes at a system level. However, the identification of differentially expressed genes in experiments with small sample size, high dimensionality, and high variance remains challenging, limiting the usability of these tens of thousands of publicly available, and possibly many more unpublished, gene expression datasets. We propose a novel variable selection algorithm for ultra-low-n microarray studies using generalized linear model-based variable selection with a penalized binomial regression algorithm called penalized Euclidean distance (PED). Our method uses PED to build a classifier on the experimental data to rank genes by importance. In place of cross-validation, which is required by most similar methods but not reliable for experiments with small sample size, we use a simulation-based approach to additively build a list of differentially expressed genes from the rank-ordered list. Our simulation-based approach maintains a low false discovery rate while maximizing the number of differentially expressed genes identified, a feature critical for downstream pathway analysis. We apply our method to microarray data from an experiment perturbing the Notch signaling pathway in Xenopus laevis embryos. This dataset was chosen because it showed very little differential expression according to limma, a powerful and widely-used method for microarray analysis. Our method was able to detect a significant number of differentially expressed genes in this dataset and suggest future directions for investigation. Our method is easily adaptable for analysis of data from RNA-seq and other global expression experiments with low sample size and high dimensionality
Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review
A variety of genome-wide profiling techniques are available to probe
complementary aspects of genome structure and function. Integrative analysis of
heterogeneous data sources can reveal higher-level interactions that cannot be
detected based on individual observations. A standard integration task in
cancer studies is to identify altered genomic regions that induce changes in
the expression of the associated genes based on joint analysis of genome-wide
gene expression and copy number profiling measurements. In this review, we
provide a comparison among various modeling procedures for integrating
genome-wide profiling data of gene copy number and transcriptional alterations
and highlight common approaches to genomic data integration. A transparent
benchmarking procedure is introduced to quantitatively compare the cancer gene
prioritization performance of the alternative methods. The benchmarking
algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin
APPLE: Approximate Path for Penalized Likelihood Estimators
In high-dimensional data analysis, penalized likelihood estimators are shown
to provide superior results in both variable selection and parameter
estimation. A new algorithm, APPLE, is proposed for calculating the Approximate
Path for Penalized Likelihood Estimators. Both the convex penalty (such as
LASSO) and the nonconvex penalty (such as SCAD and MCP) cases are considered.
The APPLE efficiently computes the solution path for the penalized likelihood
estimator using a hybrid of the modified predictor-corrector method and the
coordinate-descent algorithm. APPLE is compared with several well-known
packages via simulation and analysis of two gene expression data sets.Comment: 24 pages, 9 figure
Classification of clinical outcomes using high-throughput and clinical informatics.
It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment
Significance Analysis for Pairwise Variable Selection in Classification
The goal of this article is to select important variables that can
distinguish one class of data from another. A marginal variable selection
method ranks the marginal effects for classification of individual variables,
and is a useful and efficient approach for variable selection. Our focus here
is to consider the bivariate effect, in addition to the marginal effect. In
particular, we are interested in those pairs of variables that can lead to
accurate classification predictions when they are viewed jointly. To accomplish
this, we propose a permutation test called Significance test of Joint Effect
(SigJEff). In the absence of joint effect in the data, SigJEff is similar or
equivalent to many marginal methods. However, when joint effects exist, our
method can significantly boost the performance of variable selection. Such
joint effects can help to provide additional, and sometimes dominating,
advantage for classification. We illustrate and validate our approach using
both simulated example and a real glioblastoma multiforme data set, which
provide promising results.Comment: 28 pages, 7 figure
- …