19 research outputs found

    A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

    Get PDF
    Analysis of large-scale gene expression studies usually begins with gene clustering. A ubiquitous problem is that different algorithms applied to the same data inevitably give different results, and the differences are often substantial, involving a quarter or more of the genes analyzed. This raises a series of important but nettlesome questions: How are different clustering results related to each other and to the underlying data structure? Is one clustering objectively superior to another? Which differences, if any, are likely candidates to be biologically important? A systematic and quantitative way to address these questions is needed, together with an effective way to integrate and leverage expression results with other kinds of large-scale data and annotations. We developed a mathematical and computational framework to help quantify, compare, visualize and interactively mine clusterings. We show that by coupling confusion matrices with appropriate metrics (linear assignment and normalized mutual information scores), one can quantify and map differences between clusterings. A version of receiver operator characteristic analysis proved effective for quantifying and visualizing cluster quality and overlap. These methods, plus a flexible library of clustering algorithms, can be called from a new expandable set of software tools called CompClust 1.0 (). CompClust also makes it possible to relate expression clustering patterns to DNA sequence motif occurrences, protein–DNA interaction measurements and various kinds of functional annotations. Test analyses used yeast cell cycle data and revealed data structure not obvious under all algorithms. These results were then integrated with transcription motif and global protein–DNA interaction data to identify G(1) regulatory modules

    FPKM values computed from RNA-seq measurements of single cells taken from developing forelimbs of C57BL/6 mice

    No full text
    This table displays FPKM values computed from RNA-seq measurements of single cells taken from developing forelimbs of C57BL/6 mice. The reads were aligned with STAR version 2.5.2a and quantifications made using RSEM version 1.2.15 We used index files provided by www.encodeproject.org. For STAR we used index files from ENCFF483PAE, and RSEM index files ENCFF064YNQ, which were built from male mm10, the GENCODE M4 comprehensive set with tRNAs and ERCC spike ins which all available from ENCFF533JRE

    Integrating expression data, regulatory motif conservation and protein–DNA binding information

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data"</p><p>Nucleic Acids Research 2005;33(8):2580-2594.</p><p>Published online 10 May 2005</p><p>PMCID:PMC1092273.</p><p>© The Author 2005. Published by Oxford University Press. All rights reserved</p> () Binding site enrichment in genes from the four confusion matrix cells of that dissect genes in the G cell cycle phase. Shown in red are the observed number of genes with a MCS score above threshold for each motif. Shown in blue are the number of genes expected by chance, as computed by bootstrap simulations. The total number of genes each cell contains is in the upper left. (B–D) Heat-map displays showing expression data on the left, followed by MCS scores for a specified motif, followed by protein–DNA binding data for transcription factors implicated in binding to the specified consensus. Color scales for each panel are at the bottom of the figure. For the MCS scores, the color map ranges from 0 to the 99th percentile to minimize the influence of extreme outliers on interpretation. () Shown are 14 genes that fall within the EM1/Early G intersection cell and have a conserved enrichment in the presence of the SWI5 consensus as measured by MCS scores (see Methods; –) () Shown are 79 genes that fall within EM2/Late G intersection cell and have a high MCS score for MCB. () Shown are 20 genes that fall within EM2/Late G intersection cell and have a high MCS score for SCB. In each heat-map genes are ordered by decreasing MCS score. Significant correlation can be seen between a high MCS score, protein–DNA binding and the expected expression pattern
    corecore