42,784 research outputs found

    Robustness of Random Forest-based gene selection methods

    Full text link
    Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies. The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed that all investigated methods were equivalent in this context. However, the methods substantially differed with respect to the number of selected genes and the stability of selection. Of the analysed methods, the Boruta algorithm predicted the most genes as potentially important. The post-selection classifier error rate, which is a frequently used measure, was found to be a potentially deceptive measure of gene selection quality. When the number of consistently selected genes was considered, the Boruta algorithm was clearly the best. Although it was also the most computationally intensive method, the Boruta algorithm's computational demands could be reduced to levels comparable to those of other algorithms by replacing the Random Forest importance with a comparable measure from Random Ferns (a similar but simplified classifier). Despite their design assumptions, the minimal optimal selection methods, were found to select a high fraction of false positives

    Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

    Get PDF
    Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure

    Assessment of SVM Reliability for Microarray Data Analysis

    Get PDF
    The goal of our research is to provide techniques that can assess and validate the results of SVM-based analysis of microarray data. We present preliminary results of the effect of mislabeled training samples. We conducted several systematic experiments on artificial and real medical data using SVMs. We systematically flipped the labels of a fraction of the training data. We show that a relatively small number of mislabeled examples can dramatically decrease the performance as visualized on the ROC graphs. This phenomenon persists even if the dimensionality of the input space is drastically decreased, by using for example feature selection. Moreover we show that for SVM recursive feature elimination, even a small fraction of mislabeled samples can completely change the resulting set of genes. This work is an extended version of the previous paper [MBN04]

    Profound effect of profiling platform and normalization strategy on detection of differentially expressed microRNAs

    Get PDF
    Adequate normalization minimizes the effects of systematic technical variations and is a prerequisite for getting meaningful biological changes. However, there is inconsistency about miRNA normalization performances and recommendations. Thus, we investigated the impact of seven different normalization methods (reference gene index, global geometric mean, quantile, invariant selection, loess, loessM, and generalized procrustes analysis) on intra- and inter-platform performance of two distinct and commonly used miRNA profiling platforms. We included data from miRNA profiling analyses derived from a hybridization-based platform (Agilent Technologies) and an RT-qPCR platform (Applied Biosystems). Furthermore, we validated a subset of miRNAs by individual RT-qPCR assays. Our analyses incorporated data from the effect of differentiation and tumor necrosis factor alpha treatment on primary human skeletal muscle cells and a murine skeletal muscle cell line. Distinct normalization methods differed in their impact on (i) standard deviations, (ii) the area under the receiver operating characteristic (ROC) curve, (iii) the similarity of differential expression. Loess, loessM, and quantile analysis were most effective in minimizing standard deviations on the Agilent and TLDA platform. Moreover, loess, loessM, invariant selection and generalized procrustes analysis increased the area under the ROC curve, a measure for the statistical performance of a test. The Jaccard index revealed that inter-platform concordance of differential expression tended to be increased by loess, loessM, quantile, and GPA normalization of AGL and TLDA data as well as RGI normalization of TLDA data. We recommend the application of loess, or loessM, and GPA normalization for miRNA Agilent arrays and qPCR cards as these normalization approaches showed to (i) effectively reduce standard deviations, (ii) increase sensitivity and accuracy of differential miRNA expression detection as well as (iii) increase inter-platform concordance. Results showed the successful adoption of loessM and generalized procrustes analysis to one-color miRNA profiling experiments

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies

    Get PDF
    Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity

    Software and methods for oligonucleotide and cDNA array data analysis.

    Get PDF
    Two HTML-based programs were developed to analyze and filter gene-expression data: 'Bullfrog' for Affymetrix oligonucleotide arrays and 'Spot' for custom cDNA arrays. The programs provide intuitive data-filtering tools through an easy-to-use interface. A background subtraction and normalization program for cDNA arrays was also built that provides an informative summary report with data-quality assessments. These programs are freeware to aid in the analysis of gene-expression results and facilitate the search for genes responsible for interesting biological processes and phenotypes

    A comparative analysis of existing oligonucleotides selection algorithms for microarray technology

    Get PDF
    In system biology, DNA microarray technology is an indispensable tool for the biological analysis involved at the level of the whole genome. Among the sophisticated analytical problems in microarray technology at the front and back ends, respectively, are the selection of optimal DNA oligonucleotides (henceforth oligos) and computational analysis of the genes expression data. A computational comparative analysis of the methods used to select oligos is important since the design and quality of the microarray probes are of critical importance for the hybridization experiments as well as subsequent analysis of the data. In an attempt to enhance efficient and effective design at the front end, a computational comparative analysis was performed on oligos selection tools using the barley ESTs, as well as the Saccharomyces cerevisiae, Encephalitozoon cuniculi and human genomes. The analysis also shows that a large number of the existing tools are difficult to install and configure. For cross hybridization test, most rely on BLAST and therefore design ill specific oligonucleotides. Furthermore, most are non-intuitive to use and lack important oligo design and software features

    Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays

    Full text link
    Volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from the t test). We review the basic and an interactive use of the volcano plot, and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide an unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility to apply volcano plots to other fields beyond microarray.Comment: 8 figure
    corecore