33 research outputs found

    A Machine Learning Approach for Identifying Novel Cell Type–Specific Transcriptional Regulators of Myogenesis

    Get PDF
    Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA–based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type–specific developmental gene expression patterns

    Municipal Corporations, Homeowners, and the Benefit View of the Property Tax

    Full text link

    The enhancer classifier performs with high specificity and sensitivity.

    No full text
    <p>(A) Over-representation of TFBSs in the training set including only <i>D. melanogaster</i> enhancers and in the set extended using phylogenetic profiling, as compared with background sequence. P-values were adjusted for multiple testing using the method of Benjamini and Hochberg (BH) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Benjamini1" target="_blank">[120]</a>. (B) Average ROC curve for the 10-fold cross-validation. Our method achieves an area under the ROC curve of 0.89 (shaded in gray). FPR: false-positive rate; TPR: true-positive rate. (C) Distribution of FC enhancer scores for the genome-wide scan. Scores assigned by the classifier for each evaluated sequence are shown in red. We used a FPR of 5% to define a cut-off for putative enhancers (dotted blue line; see <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#s4" target="_blank">Materials and Methods</a> for details). (D) Fold-enrichment in 180 validated FC genes in the neighborhood of putative FC enhancers, as determined for different FPRs. Intergenic putative FC enhancers were associated with the closest gene, whereas intronic sequences were associated with their host gene. P-values were computed using the binomial test.</p

    DNA binding domains of the TFs most relevant to FC enhancer classification.

    No full text
    <p>Only DNA binding domains for the fifty most relevant TFs have been included. TFs were ranked according to the SVM weights of their respective motifs, which represent their discriminating power. Only the highest scoring motif for each TF was considered (median ranks computed across 10 random partitions of the training data varied between 12 and 117). <i>De novo</i> motifs were explicitly excluded from this analysis. TF domains and sequences have been clustered using average linkage and Euclidean distance. The dendogram on top of the heatmap represents the relationships among the sequences in the training data, built on the presence/absence of TFBSs recognized by a specific class of TF DNA binding domain. The dendogram on the left of the heatmap shows the relationships among the different TF DNA binding domains.</p

    The wild-type activities of FC enhancers require input from classifier-defined Myb and POUHD TF binding motifs.

    No full text
    <p>(A) TRANSFAC position weight matrices for Myb (VMYBQ6)andPOUHD(VMYB_Q6) and POUHD (VPOU1F1_Q6) enriched motifs identified by the classifier. (B) Binding site sequences in the <i>Ndg</i> enhancer for Myb and POUHD and versions in which those sites are selectively mutated. Motifs were defined by searching for matches to the vertebrate homologues in the UniPROBE database <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Robasky1" target="_blank">[99]</a>. The identification of these binding sites and the designs of the mutant versions are described in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531.s013" target="_blank">Table S4</a>. (C) GFP (green) and β-Gal (magenta) are co-expressed when driven by the wild-type (WT) <i>Ndg</i> enhancer (<i>Ndg<sup>WT</sup>-GFP</i> and <i>Ndg<sup>WT</sup>-lacZ</i>, respectively). (D) GFP (green) expression driven by a version of the <i>Ndg</i> enhancer in which POUHD sites are selectively inactivated (<i>Ndg<sup>POUHD</sup>-GFP</i>) is significantly reduced compared to β-Gal (magenta) driven by <i>Ndg<sup>WT</sup>-lacZ</i>. (E) β-Gal driven by a version of the <i>Ndg</i> enhancer in which Myb binding sites are selectively inactivated (<i>Ndg<sup>Myb</sup>-lacZ</i>) is de-repressed into additional somatic mesodermal cells compared to GFP driven by a WT version of the <i>Ndg</i> enhancer (<i>Ndg<sup>WT</sup>-GFP</i>).</p

    Candidate enhancers predicted by the classifier are active in FCs.

    No full text
    <p><i>In situ</i> hybridization of <i>dve</i> in wild-type (WT) embryos and embryos over-expressing Ras (Twi>Ras) in the mesoderm (A). Note the increased activity of <i>dve</i> in Twi>Ras embryos, indicative of a FC gene <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Estrada1" target="_blank">[22]</a>. GFP driven by the classifier-predicted enhancers associated with the upstream sequences of <i>slou</i> (arrows in B) and <i>slp1</i> (arrows in C). Slou protein (magenta) co-expresses with GFP (green) in <i>slou-GFP</i> embryos (B). Duf (magenta), which marks all FCs, co-expresses with <i>slp1</i>-<i>GFP</i> (green) (C). GFP (D) driven by the classifier-predicted intronic sequence associated with the <i>dve</i> gene co-expresses with Mef2 (D′) in myotubes at stage 15 in <i>dve-GFP</i> embryos.</p

    TFBS combinatorics within FC enhancers.

    No full text
    <p>(A) Distribution of Tcf, Mad, Pnt, Twi, Tin, POUHD, Tbx, Myb, Fkh, HD and Mef2 TFBSs in FC enhancers. Binding sites for Tcf, Mad, Pnt, Twi and Tin were previously published <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Philippakis1" target="_blank">[5]</a>. Motif matches for motifs most relevant to the classification for a given DNA binding domain class: POUHD (VOCT01,VOCT_01, VPOU1F1_Q6, VOCT402),Tbx(VOCT4_02), Tbx (VTBX5_01, IBYNQ6),Myb(VBYN_Q6), Myb (VMYB_Q6), Fkh (VFOXO301,VFOXO3_01, VFOXO1_Q5, VFREAC201),HD(IFREAC2_01), HD (IABDA_Q6, VCDX5Q5,VCDX5_Q5, VIFP_03, VPAX402),andMef2(VPAX4_02), and Mef2 (VAMEF2_Q6, V$HMEF2_Q6). These sites were mapped using MAST under default parameters <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Bailey1" target="_blank">[118]</a>. (B) A generic FC enhancer receives differential input from signal-activated, ubiquitous, tissue-restricted and cell type-specific TFs. HD binding motifs are represented as both tissue-restricted and cell type-specific classes since these motifs receive input from both Hox TFs, which are widely expressed in the mesoderm <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Capovilla1" target="_blank">[35]</a>, <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Michelson1" target="_blank">[59]</a>, <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Enriquez1" target="_blank">[67]</a>, and muscle identity HD TFs—such as Slou, Msh and Ap—which are cell type-specific <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Knirr2" target="_blank">[54]</a>, <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Bourgouin1" target="_blank">[68]</a>, <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Nose1" target="_blank">[69]</a>. For this diagram, HD binding sites were not subdivided into the distinct binding profiles that have been identified for each individual HD TF (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Noyes1" target="_blank">[83]</a>, <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1002531#pgen.1002531-Berger1" target="_blank">[126]</a> and B. W. Busser, L. Shokri, S. A. Jaeger, S. S. Gisselbrecht, A. Singhania, M. F. Berger, B. Zhou, M. L. Bulyk and A. M. Michelson, unpublished data).</p
    corecore