19 research outputs found
A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data
Analysis of large-scale gene expression studies usually begins with gene clustering. A ubiquitous problem is that different algorithms applied to the same data inevitably give different results, and the differences are often substantial, involving a quarter or more of the genes analyzed. This raises a series of important but nettlesome questions: How are different clustering results related to each other and to the underlying data structure? Is one clustering objectively superior to another? Which differences, if any, are likely candidates to be biologically important? A systematic and quantitative way to address these questions is needed, together with an effective way to integrate and leverage expression results with other kinds of large-scale data and annotations. We developed a mathematical and computational framework to help quantify, compare, visualize and interactively mine clusterings. We show that by coupling confusion matrices with appropriate metrics (linear assignment and normalized mutual information scores), one can quantify and map differences between clusterings. A version of receiver operator characteristic analysis proved effective for quantifying and visualizing cluster quality and overlap. These methods, plus a flexible library of clustering algorithms, can be called from a new expandable set of software tools called CompClust 1.0 (). CompClust also makes it possible to relate expression clustering patterns to DNA sequence motif occurrences, protein–DNA interaction measurements and various kinds of functional annotations. Test analyses used yeast cell cycle data and revealed data structure not obvious under all algorithms. These results were then integrated with transcription motif and global protein–DNA interaction data to identify G(1) regulatory modules
Recommended from our members
A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data
Analysis of large-scale gene expression studies usually begins with gene clustering. A ubiquitous problem is that different algorithms applied to the same data inevitably give different results, and the differences are often substantial, involving a quarter or more of the genes analyzed. This raises a series of important but nettlesome questions: How are different clustering results related to each other and to the underlying data structure? Is one clustering objectively superior to another? Which differences, if any, are likely candidates to be biologically important? A systematic and quantitative way to address these questions is needed, together with an effective way to integrate and leverage expression results with other kinds of large-scale data and annotations. We developed a mathematical and computational framework to help quantify, compare, visualize and interactively mine clusterings. We show that by coupling confusion matrices with appropriate metrics (linear assignment and normalized mutual information scores), one can quantify and map differences between clusterings. A version of receiver operator characteristic analysis proved effective for quantifying and visualizing cluster quality and overlap. These methods, plus a flexible library of clustering algorithms, can be called from a new expandable set of software tools called CompClust 1.0 (). CompClust also makes it possible to relate expression clustering patterns to DNA sequence motif occurrences, protein–DNA interaction measurements and various kinds of functional annotations. Test analyses used yeast cell cycle data and revealed data structure not obvious under all algorithms. These results were then integrated with transcription motif and global protein–DNA interaction data to identify G1 regulatory modules
FPKM values computed from RNA-seq measurements of single cells taken from developing forelimbs of C57BL/6 mice
This table displays FPKM values computed from RNA-seq measurements of single cells taken from developing forelimbs of C57BL/6 mice. The reads were aligned with STAR version 2.5.2a and quantifications made using RSEM version 1.2.15
We used index files provided by www.encodeproject.org. For STAR we used index files from ENCFF483PAE, and RSEM index files ENCFF064YNQ, which were built from male mm10, the GENCODE M4 comprehensive set with tRNAs and ERCC spike ins which all available from ENCFF533JRE
Integrating expression data, regulatory motif conservation and protein–DNA binding information
<p><b>Copyright information:</b></p><p>Taken from "A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data"</p><p>Nucleic Acids Research 2005;33(8):2580-2594.</p><p>Published online 10 May 2005</p><p>PMCID:PMC1092273.</p><p>© The Author 2005. Published by Oxford University Press. All rights reserved</p> () Binding site enrichment in genes from the four confusion matrix cells of that dissect genes in the G cell cycle phase. Shown in red are the observed number of genes with a MCS score above threshold for each motif. Shown in blue are the number of genes expected by chance, as computed by bootstrap simulations. The total number of genes each cell contains is in the upper left. (B–D) Heat-map displays showing expression data on the left, followed by MCS scores for a specified motif, followed by protein–DNA binding data for transcription factors implicated in binding to the specified consensus. Color scales for each panel are at the bottom of the figure. For the MCS scores, the color map ranges from 0 to the 99th percentile to minimize the influence of extreme outliers on interpretation. () Shown are 14 genes that fall within the EM1/Early G intersection cell and have a conserved enrichment in the presence of the SWI5 consensus as measured by MCS scores (see Methods; –) () Shown are 79 genes that fall within EM2/Late G intersection cell and have a high MCS score for MCB. () Shown are 20 genes that fall within EM2/Late G intersection cell and have a high MCS score for SCB. In each heat-map genes are ordered by decreasing MCS score. Significant correlation can be seen between a high MCS score, protein–DNA binding and the expected expression pattern
Recommended from our members
Spatiotemporal DNA methylome dynamics of the developing mouse fetus.
Cytosine DNA methylation is essential for mammalian development but understanding of its spatiotemporal distribution in the developing embryo remains limited1,2. Here, as part of the mouse Encyclopedia of DNA Elements (ENCODE) project, we profiled 168 methylomes from 12 mouse tissues or organs at 9 developmental stages from embryogenesis to adulthood. We identified 1,808,810 genomic regions that showed variations in CG methylation by comparing the methylomes of different tissues or organs from different developmental stages. These DNA elements predominantly lose CG methylation during fetal development, whereas the trend is reversed after birth. During late stages of fetal development, non-CG methylation accumulated within the bodies of key developmental transcription factor genes, coinciding with their transcriptional repression. Integration of genome-wide DNA methylation, histone modification and chromatin accessibility data enabled us to predict 461,141 putative developmental tissue-specific enhancers, the human orthologues of which were enriched for disease-associated genetic variants. These spatiotemporal epigenome maps provide a resource for studies of gene regulation during tissue or organ progression, and a starting point for investigating regulatory elements that are involved in human developmental disorders
Recommended from our members
Author Correction: An atlas of dynamic chromatin landscapes in mouse fetal development.
A Correction to this paper has been published: https://doi.org/10.1038/s41586-020-03089-4
Recommended from our members
Author Correction: An atlas of dynamic chromatin landscapes in mouse fetal development.
A Correction to this paper has been published: https://doi.org/10.1038/s41586-020-03089-4