23 research outputs found

    Finding Biomarker Signatures in Pooled Sample Designs: A Simulation Framework for Methodological Comparisons

    Get PDF
    Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest

    Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>For heterogeneous tissues, such as blood, measurements of gene expression are confounded by relative proportions of cell types involved. Conclusions have to rely on estimation of gene expression signals for homogeneous cell populations, e.g. by applying micro-dissection, fluorescence activated cell sorting, or <it>in-silico </it>deconfounding. We studied feasibility and validity of a non-negative matrix decomposition algorithm using experimental gene expression data for blood and sorted cells from the same donor samples. Our objective was to optimize the algorithm regarding detection of differentially expressed genes and to enable its use for classification in the difficult scenario of reversely regulated genes. This would be of importance for the identification of candidate biomarkers in heterogeneous tissues.</p> <p>Results</p> <p>Experimental data and simulation studies involving noise parameters estimated from these data revealed that for valid detection of differential gene expression, quantile normalization and use of non-log data are optimal. We demonstrate the feasibility of predicting proportions of constituting cell types from gene expression data of single samples, as a prerequisite for a deconfounding-based classification approach.</p> <p>Classification cross-validation errors with and without using deconfounding results are reported as well as sample-size dependencies. Implementation of the algorithm, simulation and analysis scripts are available.</p> <p>Conclusions</p> <p>The deconfounding algorithm without decorrelation using quantile normalization on non-log data is proposed for biomarkers that are difficult to detect, and for cases where confounding by varying proportions of cell types is the suspected reason. In this case, a deconfounding ranking approach can be used as a powerful alternative to, or complement of, other statistical learning approaches to define candidate biomarkers for molecular diagnosis and prediction in biomedicine, in realistically noisy conditions and with moderate sample sizes.</p

    Phalaena sp.

    Get PDF
    Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called ‘power parameter’, which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis

    The mean number of components used for simulated data for and  = 0.1.

    No full text
    <p>The mean number of components used for simulated data for and  = 0.1.</p

    Rough overview of the proposed extension of PPLS-DA.

    No full text
    <p>Rough overview of the proposed extension of PPLS-DA.</p

    Mean PE of PPLS-DA using , and , PLS-DA, t-LDA and SVM for the five cases of the simulated data.

    No full text
    <p>Mean PE of PPLS-DA using , and , PLS-DA, t-LDA and SVM for the five cases of the simulated data.</p

    Condition index for the first five eigenvalues.

    No full text
    <p>The condition index ( number of features) is used as a measure for variable dependence, with eigenvalue of . It can be assumed that . The increase of the first five condition indexes () reflects the collinearity of the features. A rapid increase means, the features are strong linear dependent, a weak increase implies a weak dependence.</p

    Extension of PPLS-DA - for stepsize and .

    No full text
    <p>The power parameter is denoted by , the prediction error (number of wrongly classified samples of the inner test set) is abbreviated with PE. varied in 11 steps (). Cj, j = 15 is short for the jth component. The function min(f) takes the minimum of function . The cross-validation procedures consist of random samples of the outer training set to the proportions of 0.7 (training set) and 0.3 (test set). The cross-validation steps are conform to sampling with replacement. The optimal -value and the optimal number of components are determined after 50 repeats.</p

    Overview of the experimental data sets.

    No full text
    a<p>For the determination of the number of differentially expressed genes () we use a t-test (from the R-package stats) and an FDR correction <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0055267#pone.0055267-Storey1" target="_blank">[15]</a> (R-package qvalue). We count all genes with a q-value below 0.05.</p
    corecore