383 research outputs found

    A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Significance Analysis of Microarrays (SAM) is a popular method for detecting significantly expressed genes and controlling the false discovery rate (FDR). Recently, it has been reported in the literature that the FDR is not well controlled by SAM. Due to the vast application of SAM in microarray data analysis, it is of great importance to have an extensive evaluation of SAM and its associated R-package (sam2.20).</p> <p>Results</p> <p>Our study has identified several discrepancies between SAM and sam2.20. One major difference is that SAM and sam2.20 use different methods for estimating FDR. Such discrepancies may cause confusion among the researchers who are using SAM or are developing the SAM-like methods. We have also shown that SAM provides no meaningful estimates of FDR and this problem has been corrected in sam2.20 by using a different formula for estimating FDR. However, we have found that, even with the improvement sam2.20 has made over SAM, sam2.20 may still produce erroneous and even conflicting results under certain situations. Using an example, we show that the problem of sam2.20 is caused by its use of asymmetric cutoffs which are due to the large variability of null scores at both ends of the order statistics. An obvious approach without the complication of the order statistics is the conventional symmetric cutoff method. For this reason, we have carried out extensive simulations to compare the performance of sam2.20 and the symmetric cutoff method. Finally, a simple modification is proposed to improve the FDR estimation of sam2.20 and the symmetric cutoff method.</p> <p>Conclusion</p> <p>Our study shows that the most serious drawback of SAM is its poor estimation of FDR. Although this drawback has been corrected in sam2.20, the control of FDR by sam2.20 is still not satisfactory. The comparison between sam2.20 and the symmetric cutoff method reveals that the relative performance of sam2.20 to the symmetric cutff method depends on the ratio of induced to repressed genes in a microarray data, and is also affected by the ratio of DE to EE genes and the distributions of induced and repressed genes. Numerical simulations show that the symmetric cutoff method has the biggest advantage over sam2.20 when there are equal number of induced and repressed genes (i.e., the ratio of induced to repressed genes is 1). As the ratio of induced to repressed genes moves away from 1, the advantage of the symmetric cutoff method to sam2.20 is gradually diminishing until eventually sam2.20 becomes significantly better than the symmetric cutoff method when the differentially expressed (DE) genes are either all induced or all repressed genes. Simulation results also show that our proposed simple modification provides improved control of FDR for both sam2.20 and the symmetric cutoff method.</p

    Climate warming, marine protected areas and the ocean-scale integrity of coral reef ecosystems

    Get PDF
    Coral reefs have emerged as one of the ecosystems most vulnerable to climate variation and change. While the contribution of a warming climate to the loss of live coral cover has been well documented across large spatial and temporal scales, the associated effects on fish have not. Here, we respond to recent and repeated calls to assess the importance of local management in conserving coral reefs in the context of global climate change. Such information is important, as coral reef fish assemblages are the most species dense vertebrate communities on earth, contributing critical ecosystem functions and providing crucial ecosystem services to human societies in tropical countries. Our assessment of the impacts of the 1998 mass bleaching event on coral cover, reef structural complexity, and reef associated fishes spans 7 countries, 66 sites and 26 degrees of latitude in the Indian Ocean. Using Bayesian meta-analysis we show that changes in the size structure, diversity and trophic composition of the reef fish community have followed coral declines. Although the ocean scale integrity of these coral reef ecosystems has been lost, it is positive to see the effects are spatially variable at multiple scales, with impacts and vulnerability affected by geography but not management regime. Existing no-take marine protected areas still support high biomass of fish, however they had no positive affect on the ecosystem response to large-scale disturbance. This suggests a need for future conservation and management efforts to identify and protect regional refugia, which should be integrated into existing management frameworks and combined with policies to improve system-wide resilience to climate variation and change

    Empirical evaluation of prediction intervals for cancer incidence

    Get PDF
    BACKGROUND: Prediction intervals can be calculated for predicting cancer incidence on the basis of a statistical model. These intervals include the uncertainty of the parameter estimates and variations in future rates but do not include the uncertainty of assumptions, such as continuation of current trends. In this study we evaluated whether prediction intervals are useful in practice. METHODS: Rates for the period 1993–97 were predicted from cancer incidence rates in the five Nordic countries for the period 1958–87. In a Poisson regression model, 95% prediction intervals were constructed for 200 combinations of 20 cancer types for males and females in the five countries. The coverage level was calculated as the proportion of the prediction intervals that covered the observed number of cases in 1993–97. RESULTS: Overall, 52% (104/200) of the prediction intervals covered the observed numbers. When the prediction intervals were divided into quartiles according to the number of cases in the last observed period, the coverage level was inversely proportional to the frequency (84%, 52%, 46% and 26%). The coverage level varied widely among the five countries, but the difference declined after adjustment for the number of cases in each country. CONCLUSION: The coverage level of prediction intervals strongly depended on the number of cases on which the predictions were based. As the sample size increased, uncertainty about the adequacy of the model dominated, and the coverage level fell far below 95%. Prediction intervals for cancer incidence must therefore be interpreted with caution

    Genomic breeding value estimation using nonparametric additive regression models

    Get PDF
    Genomic selection refers to the use of genomewide dense markers for breeding value estimation and subsequently for selection. The main challenge of genomic breeding value estimation is the estimation of many effects from a limited number of observations. Bayesian methods have been proposed to successfully cope with these challenges. As an alternative class of models, non- and semiparametric models were recently introduced. The present study investigated the ability of nonparametric additive regression models to predict genomic breeding values. The genotypes were modelled for each marker or pair of flanking markers (i.e. the predictors) separately. The nonparametric functions for the predictors were estimated simultaneously using additive model theory, applying a binomial kernel. The optimal degree of smoothing was determined by bootstrapping. A mutation-drift-balance simulation was carried out. The breeding values of the last generation (genotyped) was predicted using data from the next last generation (genotyped and phenotyped). The results show moderate to high accuracies of the predicted breeding values. A determination of predictor specific degree of smoothing increased the accuracy

    Phylogenetic and environmental context of a Tournaisian tetrapod fauna

    Get PDF
    The end-Devonian to mid-Mississippian time interval has long been known for its depauperate palaeontological record, especially for tetrapods. This interval encapsulates the time of increasing terrestriality among tetrapods, but only two Tournaisian localities previously produced tetrapod fossils. Here we describe five new Tournaisian tetrapods (Perittodus apsconditus\textit{Perittodus apsconditus}, Koilops herma\textit{Koilops herma}, Ossirarus kierani\textit{Ossirarus kierani}, Diploradus austiumensis\textit{Diploradus austiumensis} and Aytonerpeton microps\textit{Aytonerpeton microps}) from two localities in their environmental context. A phylogenetic analysis retrieved three taxa as stem tetrapods, interspersed among Devonian and Carboniferous forms, and two as stem amphibians, suggesting a deep split among crown tetrapods. We also illustrate new tetrapod specimens from these and additional localities in the Scottish Borders region. The new taxa and specimens suggest that tetrapod diversification was well established by the Tournaisian. Sedimentary evidence indicates that the tetrapod fossils are usually associated with sandy siltstones overlying wetland palaeosols. Tetrapods were probably living on vegetated surfaces that were subsequently flooded. We show that atmospheric oxygen levels were stable across the Devonian/Carboniferous boundary, and did not inhibit the evolution of terrestriality. This wealth of tetrapods from Tournaisian localities highlights the potential for discoveries elsewhere.NERC consortium grants NE/J022713/1 (Cambridge), NE/J020729/1 (Leicester), NE/J021067/1 (BGS), NE/J020621/1 (NMS) and NE/J021091/1 (Southampton

    Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data

    Get PDF
    BACKGROUND: Designing appropriate machine learning methods for identifying genes that have a significant discriminating power for disease outcomes has become more and more important for our understanding of diseases at genomic level. Although many machine learning methods have been developed and applied to the area of microarray gene expression data analysis, the majority of them are based on linear models, which however are not necessarily appropriate for the underlying connection between the target disease and its associated explanatory genes. Linear model based methods usually also bring in false positive significant features more easily. Furthermore, linear model based algorithms often involve calculating the inverse of a matrix that is possibly singular when the number of potentially important genes is relatively large. This leads to problems of numerical instability. To overcome these limitations, a few non-linear methods have recently been introduced to the area. Many of the existing non-linear methods have a couple of critical problems, the model selection problem and the model parameter tuning problem, that remain unsolved or even untouched. In general, a unified framework that allows model parameters of both linear and non-linear models to be easily tuned is always preferred in real-world applications. Kernel-induced learning methods form a class of approaches that show promising potentials to achieve this goal. RESULTS: A hierarchical statistical model named kernel-imbedded Gaussian process (KIGP) is developed under a unified Bayesian framework for binary disease classification problems using microarray gene expression data. In particular, based on a probit regression setting, an adaptive algorithm with a cascading structure is designed to find the appropriate kernel, to discover the potentially significant genes, and to make the optimal class prediction accordingly. A Gibbs sampler is built as the core of the algorithm to make Bayesian inferences. Simulation studies showed that, even without any knowledge of the underlying generative model, the KIGP performed very close to the theoretical Bayesian bound not only in the case with a linear Bayesian classifier but also in the case with a very non-linear Bayesian classifier. This sheds light on its broader usability to microarray data analysis problems, especially to those that linear methods work awkwardly. The KIGP was also applied to four published microarray datasets, and the results showed that the KIGP performed better than or at least as well as any of the referred state-of-the-art methods did in all of these cases. CONCLUSION: Mathematically built on the kernel-induced feature space concept under a Bayesian framework, the KIGP method presented in this paper provides a unified machine learning approach to explore both the linear and the possibly non-linear underlying relationship between the target features of a given binary disease classification problem and the related explanatory gene expression data. More importantly, it incorporates the model parameter tuning into the framework. The model selection problem is addressed in the form of selecting a proper kernel type. The KIGP method also gives Bayesian probabilistic predictions for disease classification. These properties and features are beneficial to most real-world applications. The algorithm is naturally robust in numerical computation. The simulation studies and the published data studies demonstrated that the proposed KIGP performs satisfactorily and consistently

    Optimally splitting cases for training and testing high dimensional classifiers

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?</p> <p>Results</p> <p>We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.</p> <p>Conclusions</p> <p>By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller <it>n </it>resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (<it>n </it>≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p

    Semi-supervised discovery of differential genes

    Get PDF
    BACKGROUND: Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels. RESULTS: We assume a latent variable model for the expression of active genes and apply the optimal discovery procedure (ODP) proposed by Storey (2005) to the model. Our latent variable model allows gene significance scores to be applied to unsupervised and semi-supervised cases. The ODP framework improves detectability by sharing the estimated parameters of null and alternative models of multiple tests over multiple genes. A theoretical consideration leads to two different interpretations of the latent variable, i.e., it only implicitly affects the alternative model through the model parameters, or it is explicitly included in the alternative model, so that the interpretations correspond to two different implementations of ODP. By comparing the two implementations through experiments with simulation data, we have found that sharing the latent variable estimation is effective for increasing the detectability of truly active genes. We also show that the unsupervised and semi-supervised rating of genes, which takes into account the samples without condition labels, can improve detection of active genes in real gene discovery problems. CONCLUSION: The experimental results indicate that the ODP framework is effective for hypotheses including latent variables and is further improved by sharing the estimations of hidden variables over multiple tests

    Sensitivity of methods for estimating breeding values using genetic markers to the number of QTL and distribution of QTL variance

    Get PDF
    The objective of this simulation study was to compare the effect of the number of QTL and distribution of QTL variance on the accuracy of breeding values estimated with genomewide markers (MEBV). Three distinct methods were used to calculate MEBV: a Bayesian Method (BM), Least Angle Regression (LARS) and Partial Least Square Regression (PLSR). The accuracy of MEBV calculated with BM and LARS decreased when the number of simulated QTL increased. The accuracy decreased more when QTL had different variance values than when all QTL had an equal variance. The accuracy of MEBV calculated with PLSR was affected neither by the number of QTL nor by the distribution of QTL variance. Additional simulations and analyses showed that these conclusions were not affected by the number of individuals in the training population, by the number of markers and by the heritability of the trait. Results of this study show that the effect of the number of QTL and distribution of QTL variance on the accuracy of MEBV depends on the method that is used to calculate MEBV
    corecore