71 research outputs found

    A powerful method for detecting differentially expressed genes from GeneChip arrays that does not require replicates

    Get PDF
    BACKGROUND: Studies of differential expression that use Affymetrix GeneChip arrays are often carried out with a limited number of replicates. Reasons for this include financial considerations and limits on the available amount of RNA for sample preparation. In addition, failed hybridizations are not uncommon leading to a further reduction in the number of replicates available for analysis. Most existing methods for studying differential expression rely on the availability of replicates and the demand for alternative methods that require few or no replicates is high. RESULTS: We describe a statistical procedure for performing differential expression analysis without replicates. The procedure relies on a Bayesian integrated approach (BGX) to the analysis of Affymetrix GeneChips. The BGX method estimates a posterior distribution of expression for each gene and condition, from a simultaneous consideration of the available probe intensities representing the gene in a condition. Importantly, posterior distributions of expression are obtained regardless of the number of replicates available. We exploit these posterior distributions to create ranked gene lists that take into account the estimated expression difference as well as its associated uncertainty. We estimate the proportion of non-differentially expressed genes empirically, allowing an informed choice of cut-off for the ranked gene list, adapting an approach proposed by Efron. We assess the performance of the method, and compare it to those of other methods, on publicly available spike-in data sets, as well as in a proper biological setting. CONCLUSION: The method presented is a powerful tool for extracting information on differential expression from GeneChip expression studies with limited or no replicates

    A close examination of double filtering with fold change and t test in microarray analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many researchers use the double filtering procedure with fold change and <it>t </it>test to identify differentially expressed genes, in the hope that the double filtering will provide extra confidence in the results. Due to its simplicity, the double filtering procedure has been popular with applied researchers despite the development of more sophisticated methods.</p> <p>Results</p> <p>This paper, for the first time to our knowledge, provides theoretical insight on the drawback of the double filtering procedure. We show that fold change assumes all genes to have a common variance while <it>t </it>statistic assumes gene-specific variances. The two statistics are based on contradicting assumptions. Under the assumption that gene variances arise from a mixture of a common variance and gene-specific variances, we develop the theoretically most powerful likelihood ratio test statistic. We further demonstrate that the posterior inference based on a Bayesian mixture model and the widely used significance analysis of microarrays (SAM) statistic are better approximations to the likelihood ratio test than the double filtering procedure.</p> <p>Conclusion</p> <p>We demonstrate through hypothesis testing theory, simulation studies and real data examples, that well constructed shrinkage testing methods, which can be united under the mixture gene variance assumption, can considerably outperform the double filtering procedure.</p

    Assessing and selecting gene expression signals based upon the quality of the measured dynamics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the challenges with modeling the temporal progression of biological signals is dealing with the effect of noise and the limited number of replicates at each time point. Given the rising interest in utilizing predictive mathematical models to describe the biological response of an organism or analysis such as clustering and gene ontology enrichment, it is important to determine whether the dynamic progression of the data has been accurately captured despite the limited number of replicates, such that one can have confidence that the results of the analysis are capturing important salient dynamic features.</p> <p>Results</p> <p>By pre-selecting genes based upon quality before the identification of differential expression via algorithm such as EDGE, it was found that the percentage of statistically enriched ontologies (p < .05) was improved. Furthermore, it was found that a majority of the genes found via the proposed technique were also selected via an EDGE selection though the reverse was not necessarily true. It was also found that improvements offered by the proposed algorithm are anti-correlated with improvements in the various microarray platforms and the number of replicates. This is illustrated by the fact that newer arrays and experiments with more replicates show less improvement when the filtering for quality is first run before the selection of differentially expressed genes. This suggests that the increase in the number of replicates as well as improvements in array technologies are increase the confidence one has in the dynamics obtained from the experiment.</p> <p>Conclusion</p> <p>We have developed an algorithm that quantifies the quality of temporal biological signal rather than whether the signal illustrates a significant change over the experimental time course. Because the use of these temporal signals, whether it is in mathematical modeling or clustering, focuses upon the entire time series, it is necessary to develop a method to quantify and select for signals which conform to this ideal. By doing this, we have demonstrated a marked and consistent improvement in the results of a clustering exercise over multiple experiments, microarray platforms, and experimental designs.</p

    Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression

    Get PDF
    BACKGROUND: The identification of differentially expressed genes (DEGs) from Affymetrix GeneChips arrays is currently done by first computing expression levels from the low-level probe intensities, then deriving significance by comparing these expression levels between conditions. The proposed PL-LM (Probe-Level Linear Model) method implements a linear model applied on the probe-level data to directly estimate the treatment effect. A finite mixture of Gaussian components is then used to identify DEGs using the coefficients estimated by the linear model. This approach can readily be applied to experimental design with or without replication. RESULTS: On a wholly defined dataset, the PL-LM method was able to identify 75% of the differentially expressed genes within 10% of false positives. This accuracy was achieved both using the three replicates per conditions available in the dataset and using only one replicate per condition. CONCLUSION: The method achieves, on this dataset, a higher accuracy than the best set of tools identified by the authors of the dataset, and does so using only one replicate per condition

    A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.</p> <p>Results</p> <p>We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.</p> <p>Conclusion</p> <p>We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p

    Identification of deleterious non-synonymous single nucleotide polymorphisms using sequence-derived information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As the number of non-synonymous single nucleotide polymorphisms (nsSNPs), also known as single amino acid polymorphisms (SAPs), increases rapidly, computational methods that can distinguish disease-causing SAPs from neutral SAPs are needed. Many methods have been developed to distinguish disease-causing SAPs based on both structural and sequence features of the mutation point. One limitation of these methods is that they are not applicable to the cases where protein structures are not available. In this study, we explore the feasibility of classifying SAPs into disease-causing and neutral mutations using only information derived from protein sequence.</p> <p>Results</p> <p>We compiled a set of 686 features that were derived from protein sequence. For each feature, the distance between the wild-type residue and mutant-type residue was computed. Then a greedy approach was used to select the features that were useful for the classification of SAPs. 10 features were selected. Using the selected features, a decision tree method can achieve 82.6% overall accuracy with 0.607 Matthews Correlation Coefficient (MCC) in cross-validation. When tested on an independent set that was not seen by the method during the training and feature selection, the decision tree method achieves 82.6% overall accuracy with 0.604 MCC. We also evaluated the proposed method on all SAPs obtained from the Swiss-Prot, the method achieves 0.42 MCC with 73.2% overall accuracy. This method allows users to make reliable predictions when protein structures are not available. Different from previous studies, in which only a small set of features were arbitrarily chosen and considered, here we used an automated method to systematically discover useful features from a large set of features well-annotated in public databases.</p> <p>Conclusion</p> <p>The proposed method is a useful tool for the classification of SAPs, especially, when the structure of the protein is not available.</p

    Empirical Bayes models for multiple probe type microarrays at the probe level

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray data a primary objective is often to find differentially expressed genes. With empirical Bayes and penalized t-tests the sample variances are adjusted towards a global estimate, producing more stable results compared to ordinary t-tests. However, for Affymetrix type data a clear dependency between variability and intensity-level generally exists, even for logged intensities, most clearly for data at the probe level but also for probe-set summarizes such as the MAS5 expression index. As a consequence, adjustment towards a global estimate results in an intensity-level dependent false positive rate.</p> <p>Results</p> <p>We propose two new methods for finding differentially expressed genes, Probe level Locally moderated Weighted median-t (PLW) and Locally Moderated Weighted-t (LMW). Both methods use an empirical Bayes model taking the dependency between variability and intensity-level into account. A global covariance matrix is also used allowing for differing variances between arrays as well as array-to-array correlations. PLW is specially designed for Affymetrix type arrays (or other multiple-probe arrays). Instead of making inference on probe-set summaries, comparisons are made separately for each perfect-match probe and are then summarized into one score for the probe-set.</p> <p>Conclusion</p> <p>The proposed methods are compared to 14 existing methods using five spike-in data sets. For RMA and GCRMA processed data, PLW has the most accurate ranking of regulated genes in four out of the five data sets, and LMW consistently performs better than all examined moderated t-tests when used on RMA, GCRMA, and MAS5 expression indexes.</p

    Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>To identify differentially expressed genes (DEGs) from microarray data, users of the Affymetrix GeneChip system need to select both a preprocessing algorithm to obtain expression-level measurements and a way of ranking genes to obtain the most plausible candidates. We recently recommended suitable combinations of a preprocessing algorithm and gene ranking method that can be used to identify DEGs with a higher level of sensitivity and specificity. However, in addition to these recommendations, researchers also want to know which combinations enhance reproducibility.</p> <p>Results</p> <p>We compared eight conventional methods for ranking genes: weighted average difference (WAD), average difference (AD), fold change (FC), rank products (RP), moderated <it>t </it>statistic (modT), significance analysis of microarrays (samT), shrinkage <it>t </it>statistic (shrinkT), and intensity-based moderated <it>t </it>statistic (ibmT) with six preprocessing algorithms (PLIER, VSN, FARMS, multi-mgMOS (mmgMOS), MBEI, and GCRMA). A total of 36 real experimental datasets was evaluated on the basis of the area under the receiver operating characteristic curve (AUC) as a measure for both sensitivity and specificity. We found that the RP method performed well for VSN-, FARMS-, MBEI-, and GCRMA-preprocessed data, and the WAD method performed well for mmgMOS-preprocessed data. Our analysis of the MicroArray Quality Control (MAQC) project's datasets showed that the FC-based gene ranking methods (WAD, AD, FC, and RP) had a higher level of reproducibility: The percentages of overlapping genes (POGs) across different sites for the FC-based methods were higher overall than those for the <it>t</it>-statistic-based methods (modT, samT, shrinkT, and ibmT). In particular, POG values for WAD were the highest overall among the FC-based methods irrespective of the choice of preprocessing algorithm.</p> <p>Conclusion</p> <p>Our results demonstrate that to increase sensitivity, specificity, and reproducibility in microarray analyses, we need to select suitable combinations of preprocessing algorithms and gene ranking methods. We recommend the use of FC-based methods, in particular RP or WAD.</p

    Improving the prediction of disease-related variants using protein three-dimensional structure

    Get PDF
    Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures.Results: In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information.Conclusion: This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available. \ua9 2011 Capriotti; licensee BioMed Central Ltd
    corecore