9 research outputs found

    Cell-Type-Specific Predictive Network Yields Novel Insights into Mouse Embryonic Stem Cell Self-Renewal and Cell Fate

    Get PDF
    <div><p>Self-renewal, the ability of a stem cell to divide repeatedly while maintaining an undifferentiated state, is a defining characteristic of all stem cells. Here, we clarify the molecular foundations of mouse embryonic stem cell (mESC) self-renewal by applying a proven Bayesian network machine learning approach to integrate high-throughput data for protein function discovery. By focusing on a single stem-cell system, at a specific developmental stage, within the context of well-defined biological processes known to be active in that cell type, we produce a consensus predictive network that reflects biological reality more closely than those made by prior efforts using more generalized, context-independent methods. In addition, we show how machine learning efforts may be misled if the tissue specific role of mammalian proteins is not defined in the training set and circumscribed in the evidential data. For this study, we assembled an extensive compendium of mESC data: ∼2.2 million data points, collected from 60 different studies, under 992 conditions. We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination. Computational evaluations, literature validation, and analyses of predicted functional linkages show that our results are highly accurate and biologically relevant. Our mESC network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies. This network can be used by stem cell researchers (at <a href="http://StemSight.org" target="_blank">http://StemSight.org</a>) to explore hypotheses about gene function in the context of self-renewal and to prioritize genes of interest for experimental validation.</p> </div

    Comparison mESC-Specific and Test Network Connectivity.

    No full text
    <p><i>Notes:</i> The percentage of strongly connected gene hubs (those with a mean degree greater than 0.2 out of a total of 21,291 protein coding genes) is markedly higher in the mESC-specific network as compared to the superset or negative control networks. Degree is a measure that reflects the number of genes within the network that are predicted to be functionally linked to a given gene. In these networks, which were trained using a gold standard focused on mESC self-renewal, a higher mean degree indicated that the given gene is more likely to interact with multiple other genes, and tended to be enriched for mESC-specific self-renewal processes. Highly connected genes in the negative control network were predominantly annotated to biological processes related to self-renewal functions that are active in all cell types, such as transcriptional regulation, cell proliferation, as well as developmental processes associated with multiple cell types, such as embryonic morphogenesis.</p

    Summary of Integrated mESC Genomic Data.

    No full text
    *<p>Top Ranked Edges | Top 0.01% of Edges.</p><p><i>Notes:</i> A total of 77 high-throughput datasets were collected from various public sources to create a compendium of mESC-specific data that included 992 conditions (e.g. columns in a microarray matrix) and ∼2.2 million data points (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810.s008" target="_blank">Table S3</a>). These data were standardized and integrated into ∼6 billion gene/protein pairs, and used as evidential data to generate a predictive mESC-specific network focused on mESC self-renewal and cell fate. Datasets were weighted based on the amount of shared mutual information contained in each as compared to all evidential datasets used by the Bayes net. A low mean redundancy indicates the dataset is highly unique. As observed in other similar Bayesian network data integration efforts (including integration of human data), genetic and physical interaction data were the most reliable, but also the least common <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810-Charniak1" target="_blank">[11]</a>. We strove to assemble a diverse and comprehensive set of mESC data that would provide the most coverage and be highly informative. Protein-DNA Interaction data included chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-Chip) and ChIP followed by high-throughput RNA sequencing (ChIP-Seq). Top ranked edges were the 639 edges with a rank order of 1 and an inferred edge weight ≥0.9999 (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone-0056810-g003" target="_blank">Figure 3A</a>, <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810.s016" target="_blank">Table S11</a>); the top 0.01% of the network consists of the 22,664 edges with an inferred edge weight ≥0.9966 (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone-0056810-g003" target="_blank">Figure 3B</a>, dataset contributions to top 0.01% edges available at StemSight.org/stemdata.html).</p

    Cell-Type-Specific Data Integration and Machine-Learning Methodology.

    No full text
    <p>Our approach is designed to generate reliable and relevant predictive biological networks using high-throughput data limited to a specific cell type and a training gold standard focused on biological processes active in that cell type. This process can be distilled into four basic steps: <b>1.</b> Collect and standardize cell-type specific data from studies using diverse high-throughput experimental techniques, including microarray gene expression, chromatin immunoprecipitation (ChIP) on chip (ChIP-Chip), ChIP followed by high-throughput-sequencing (ChIP-Seq), affinity purification followed by mass spectrometry (AP-MS), whole-genome small interfering RNA (siRNA) screens, and phylogenetic sequence similarity. For this case study, we focused on mouse embryonic stem cell (mESC) data. <b>2.</b> Curate a process-specific gold standard training set to provide a baseline for assessing data reliability and significance for related biological processes known to be active in the cell type of interest. Our gold standard training set consists of experimentally validated pair-wise associations between genes and proteins known to be involved in mESC self-renewal, pluripotency, and cell fate determination. <b>3.</b> Iteratively test and validate networks. <i>A.</i> Use a naïve Bayesian network classifier to perform inference and predict novel gene and protein relationships. Our network predicts pairwise functional associations that influence mESC self-renewal and early developmental processes. <i>B.</i> Validate the accuracy of predicted functional relationships using standard machine learning performance metrics, cross validation, and bootstrapping, followed by evaluation of biological content. Our protocol for assessing networks ensures our results are highly reliable and relevant to mESC self-renewal. <b>4.</b> Provide community access to analyses and tools. Through StemSight.org, we provide access to network analyses and visualization tools to enable users to further explore networks centered on their genes of interest.</p

    Data Visualization for Mining mESC Self-Renewal Gene Predictions.

    No full text
    <p><b>A.</b> Views of <i>Tdh-</i>centric networks created using our StemSight Scout visualization tool, available at StemSight.org. Adjusting Scout network views to display only edges with inference scores of 0.5 and 0.9997 show that the novel gene <i>Tdh</i> is tightly connected to many well-known self-renewal genes in our training gold standard, including Pou5f1, Sox2, Nanog, Nr0B1, and Phc1. <b>B.</b> Supporting edge info for the <i>Tdh</i> – <i>Pou5f1</i> edge. Supporting edge information shows that this edge is supported by several protein-DNA interaction (PDI) assays as well as gene expression datasets from a study investigating mESC cell differentiation in different mESC cell lines. For supporting edge detail between <i>Tdh</i> and other gold standard genes, see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810.s005" target="_blank">Figure S5</a> or explore the <i>Tdh</i> interactome online at StemSight.org/scout. <b>C.</b> SPELL for StemSight Search Results. From a supporting edge information window, you can drill down to the individual gene expression levels in microarray datasets. This view shows how expression data reveals rank-ordered correlations observed between <i>Tdh</i> and gold standard genes <i>Gbf3, Fbxo15, Nr0b1, Phc1, Pou5f1,</i> and <i>Sox2.</i></p

    Advantages of Statistically Principled Approach. A.

    No full text
    <p>The iScMiD Core20 subnetwork of transcription factors used as bait in the 12 studies included in the iScMiD integrated mESC database <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810-Wang1" target="_blank">[34]</a>, <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0056810#pone.0056810-Kim2" target="_blank">[35]</a>, recreated as an undirected graph using edges available from the iScMiD website. In the iScMiD network, all edges have equal weight and all high-throughput data is considered equally reliable, hence the authors note there may be many false positives. <b>B.</b> The fully connected clique of mESC network posteriors for the iScMiD Core20 transcription factors predicts connections not shown in iScMiD and reveals potential false positives as not all connections are equally supported by the evidential data. For comparison, we checked underlying data for two edges, highlighted in yellow, one of which is not supported in the iScMiD subnetwork (Suz12– Sox2), and one which is only weakly supported in our mESC-only network (Suz12– Myc). <b>C.</b> Contrasting detailed information about underlying data supporting the strong functional linkage between <i>Suz12</i>– <i>Sox2</i> (Edge Weight: 0.9998) versus the weak linkage between <i>Suz12</i>– <i>Myc</i> (Edge Weight: 0.0007) shows that top supporting datasets vary from edge to edge and that the strength of dataset contribution to edge weight may differ significantly. (Highlighted rows are datasets that support both edges.).</p

    Naïve Bayesian Networks for Genomic Data Integration.

    No full text
    <p>A Bayesian network is a machine learning tool for organizing and encoding statistical dependence relationships among pieces of knowledge. A naïve Bayesian network is a simplified version of a Bayesian network in which all child nodes are dependent on the parent and independent of each other. This type of graphical device may be used to combine different types of evidential data and prior knowledge to generate probabilistic models of biological functional relationship networks. In our naïve Bayes net structure, the functional relationship between the pair of proteins <i>i</i> and <i>j</i> (FR<sub>ij</sub>) is a hidden conditional variable (indicating the unknown or hidden probability that these two gene products are functionally associated), on which all dataset evidence variables are dependent, and represents the discretized, observed similarity score in dataset <i>k</i> for proteins <i>i</i> and <i>j</i>. The edge weight (e<sub>ij</sub>) represents the probability that the proteins <i>ij</i> are functionally related given the evidence observed in different high-throughput datasets. Strong evidence of a functional relationship between protein pairs, measured by edge weight, indicates the proteins behave in a similar way given observed patterns in the high-throughput data. The specific nature of that relationship can be deduced by evaluating the type of datasets that contribute to that edge weight, followed experimental validation.</p

    Comparing Subnetworks of WNT Signaling Pathway Participants. A.

    No full text
    <p>WNT Signaling Pathway Subnetwork. A model of the WNT Signaling pathway adapted from the curated KEGG pathway for <i>M. musculus</i> (Mmu) includes SRCs for Wnt, Frizzled, and Dishevelled pathway participants, illustrating that not all family members are equally supported by evidential data. Curated pathways, which are cell-type agnostic, cannot capture these differences in connectivity. A corresponding network of mESC posterior edges involved in this view of WNT Signaling (created in Cytoscape) demonstrates the variance in edge weights and SRCs in the signaling cascade. <b>B.</b> The same WNT Signaling subnetwork produced using Mmu superset and negative control posterior edge weights and SRCs captures a different picture of connectivity as compared to the mESC network. Far more WNT signaling activity between different WNT family member ligands and Frizzled receptors is evident in the test subnetworks. This may reflect WNT signaling activity observed in data from both mESCs and other cellular contexts in the Mmu superset of features. The influence of WNT signaling in other cellular contexts is even stronger in the negative control subnetwork.</p

    Example of RNA editing at microRNA target sites.

    No full text
    <p>One hundred nucleotides of the 3′UTR of <i>Rpa1</i> is shown. Multiple microRNA target sites form a dense cluster in this region and contain many A-to-I editing sites. Red bases represent RNA editing sites, and blue and green bases represent different microRNA seed locations.</p
    corecore