1,101 research outputs found

    Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays

    Full text link
    Volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from the t test). We review the basic and an interactive use of the volcano plot, and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide an unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility to apply volcano plots to other fields beyond microarray.Comment: 8 figure

    Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes

    Get PDF
    Motivation: According to current consistency metrics such as percentage of overlapping genes (POG), lists of differentially expressed genes (DEGs) detected from different microarray studies for a complex disease are often highly inconsistent. This irreproducibility problem also exists in other high-throughput post-genomic areas such as proteomics and metabolism. A complex disease is often characterized with many coordinated molecular changes, which should be considered when evaluating the reproducibility of discovery lists from different studies

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Reproducible Cancer Biomarker Discovery in SELDI-TOF MS Using Different Pre-Processing Algorithms

    Get PDF
    BACKGROUND: There has been much interest in differentiating diseased and normal samples using biomarkers derived from mass spectrometry (MS) studies. However, biomarker identification for specific diseases has been hindered by irreproducibility. Specifically, a peak profile extracted from a dataset for biomarker identification depends on a data pre-processing algorithm. Until now, no widely accepted agreement has been reached. RESULTS: In this paper, we investigated the consistency of biomarker identification using differentially expressed (DE) peaks from peak profiles produced by three widely used average spectrum-dependent pre-processing algorithms based on SELDI-TOF MS data for prostate and breast cancers. Our results revealed two important factors that affect the consistency of DE peak identification using different algorithms. One factor is that some DE peaks selected from one peak profile were not detected as peaks in other profiles, and the second factor is that the statistical power of identifying DE peaks in large peak profiles with many peaks may be low due to the large scale of the tests and small number of samples. Furthermore, we demonstrated that the DE peak detection power in large profiles could be improved by the stratified false discovery rate (FDR) control approach and that the reproducibility of DE peak detection could thereby be increased. CONCLUSIONS: Comparing and evaluating pre-processing algorithms in terms of reproducibility can elucidate the relationship among different algorithms and also help in selecting a pre-processing algorithm. The DE peaks selected from small peak profiles with few peaks for a dataset tend to be reproducibly detected in large peak profiles, which suggests that a suitable pre-processing algorithm should be able to produce peaks sufficient for identifying useful and reproducible biomarkers

    Reproducibility and Concordance of Differential DNA Methylation and Gene Expression in Cancer

    Get PDF
    Background: Hundreds of genes with differential DNA methylation of promoters have been identified for various cancers. However, the reproducibility of differential DNA methylation discoveries for cancer and the relationship between DNA methylation and aberrant gene expression have not been systematically analysed. Methodology/Principal Findings: Using array data for seven types of cancers, we first evaluated the effects of experimental batches on differential DNA methylation detection. Second, we compared the directions of DNA methylation changes detected from different datasets for the same cancer. Third, we evaluated the concordance between methylation and gene expression changes. Finally, we compared DNA methylation changes in different cancers. For a given cancer, the directions of methylation and expression changes detected from different datasets, excluding potential batch effects, were highly consistent. In different cancers, DNA hypermethylation was highly inversely correlated with the down-regulation of gene expression, whereas hypomethylation was only weakly correlated with the up-regulation of genes. Finally, we found that genes commonly hypomethylated in different cancers primarily performed functions associated with chronic inflammation, such as ‘keratinization’, ‘chemotaxis ’ and ‘immune response’. Conclusions: Batch effects could greatly affect the discovery of DNA methylation biomarkers. For a particular cancer, both differential DNA methylation and gene expression can be reproducibly detected from different studies with no batc

    Multi-level reproducibility of signature hubs in human interactome for breast cancer metastasis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It has been suggested that, in the human protein-protein interaction network, changes of co-expression between highly connected proteins ("hub") and their interaction neighbours might have important roles in cancer metastasis and be predictive disease signatures for patient outcome. However, for a cancer, such disease signatures identified from different studies have little overlap.</p> <p>Results</p> <p>Here, we propose a systemic approach to evaluate the reproducibility of disease signatures at multiple levels, on the basis of some statistically testable biological models. Using two datasets for breast cancer metastasis, we showed that different signature hubs identified from different studies were highly consistent in terms of significantly sharing interaction neighbours and displaying consistent co-expression changes with their overlapping neighbours, whereas the shared interaction neighbours were significantly over-represented with known cancer genes and enriched in pathways deregulated in breast cancer pathogenesis. Then, we showed that the signature hubs identified from the two datasets were highly reproducible at the protein interaction and pathway levels in three other independent datasets.</p> <p>Conclusions</p> <p>Our results provide a possible biological model that different signature hubs altered in different patient cohorts could disturb the same pathways associated with cancer metastasis through their interaction neighbours.</p

    A statistical framework for integrating two microarray data sets in differential expression analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genome-wide concordance of different data sets has not been well considered in the analysis.</p> <p>Results</p> <p>Before considering data integration, it is necessary to evaluate the genome-wide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genome-wide concordance and the data integration can be achieved based on the normal distribution based mixture models.</p> <p>Conclusion</p> <p>The results from our simulation study suggest that misleading results can be generated if the genome-wide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.</p

    GMCM: Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models

    Get PDF
    Methods for clustering in unsupervised learning are an important part of the statistical toolbox in numerous scientific disciplines. Tewari, Giering, and Raghunathan (2011) proposed to use so-called Gaussian mixture copula models (GMCM) for general unsupervised learning based on clustering. Li, Brown, Huang, and Bickel (2011) independently discussed a special case of these GMCMs as a novel approach to meta-analysis in highdimensional settings. GMCMs have attractive properties which make them highly flexible and therefore interesting alternatives to other well-established methods. However, parameter estimation is hard because of intrinsic identifiability issues and intractable likelihood functions. Both aforementioned papers discuss similar expectation-maximization-like algorithms as their pseudo maximum likelihood estimation procedure. We present and discuss an improved implementation in R of both classes of GMCMs along with various alternative optimization routines to the EM algorithm. The software is freely available in the R package GMCM. The implementation is fast, general, and optimized for very large numbers of observations. We demonstrate the use of package GMCM through different applications
    corecore