8 research outputs found

    The relative importance of features for predicting binding regions of different TFs.

    No full text
    <p>Importance was defined as the decrease in accuracy after dropping a feature. The accuracy range was normalized to [0, 1] for each TF, where 0 is blue and 1 is red. The TFs were grouped into three classes as shown in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.g003" target="_blank">Fig 3</a>. Arrowheads indicate the most important features for predicting binding regions for most TFs.</p

    Expression patterns of genes with accessible motif occurrences are more highly correlated than those with inaccessible motifs.

    No full text
    <p>1. The number of time points in the time-series gene expression experiment.</p><p>2. The significance of difference between two correlation coefficient distributions (within-group pairwise correlations of bound and unbound sets) by one-sided KS test.</p><p>Expression patterns of genes with accessible motif occurrences are more highly correlated than those with inaccessible motifs.</p

    The 23 features used in this study.

    No full text
    <p>1.Values of the two sequence motif features were generated by scanning sequence with Position Weight Matrices from ScerTF [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref043" target="_blank">43</a>].</p><p>2.p bound or unbound genomic region for a TF in question based on ChIP data</p><p>3.The 11 chromatin state features (CS) were obtained from Pokholok <i>et al</i>. [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref044" target="_blank">44</a>]. The values for H3, ESA1, and GCN5 are averaged value over the analyzed genomic region.</p><p>4.The 10 DNA structure features (DS) were generated from principal component analysis (PCA, see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#sec009" target="_blank">Methods</a>) on 125 DNA structure properties from DiProDB [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref041" target="_blank">41</a>]. PC1-5: the average of principle component values over each target genomic region. Ī”PC1-5: the difference between the average value of a particular principle component over a target genomic region and the average value of the same principle component over regions flanking the target.</p><p>5.The biological meaning was interpreted from top 10 dinucleotide properties having highest PCA loading coefficients.</p><p>The 23 features used in this study.</p

    Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast

    No full text
    <div><p>Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA ā€œintrinsic propertiesā€ (<i>in silico</i> predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome.</p></div

    Evaluation of features distinguishing between bound and unbound regions and between regions bound by a single TF compared to the other TFs.

    No full text
    <p><i>(A</i>,<i>B)</i> The <i>p</i>-values (color scale shown, adjusted by false discovery rate control for multiple testing) from two-sided Wilcoxon rank sum tests of differences in feature values (<i>A</i>) between bound and unbound regions of all the 40 analyzed TFs jointly (ALL) and separately, and <i>(B)</i> between bound regions of a single vs. the remaining TFs. The <i>p</i>-values for (<i>A</i>) and <i>(B)</i> are shown in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.s001" target="_blank">S1 Fig</a> and <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.s002" target="_blank">S2 Fig</a>, respectively. <i>(C</i>,<i>D)</i> The value distributions of the 23 features for regions bound (black) and not bound (white) by (<i>C</i>) RAP1 and (<i>D</i>) ZAP1, respectively. The values were normalized into [0, 1] for each feature. The <i>p</i>-values of two-tailed Wilcoxon rank sum tests are shown below the boxplots: red, <i>p</i> < 10<sup>āˆ’3</sup>; white, <i>p</i> = 10<sup>āˆ’3</sup>; blue, <i>p</i> > 10<sup>āˆ’3</sup>.</p

    Contribution of SM, CS, and DS features to overall and individual TF binding region prediction.

    No full text
    <p>(<i>A</i>) The F-measure distributions of random forest classifications with different individual features or combinations of features. The y-axis indicates the probability with a specific F-measure. The arrowheads indicate the average F-measures. (<i>B</i>) The relationship between F-measures of binding predictions based on CS features only and DS features only. The dotted line shows the 1-to-1 relationship. Points in green (yellow) represent the TFs in which performance is better in the DS-only (CS-only) model. (<i>C</i>) Heat map showing the relative performance (<i>i</i>.<i>e</i>. standardized the F-measures to mean zero and variance one) in predicting binding region of each TF using individual features or combinations of features. The TFs are grouped into three classes: TFs with binding regions that can be predicted by either CS or DS (Group 1); TFs with binding regions that cannot be predicted well with only CS (Group 2) or only DS (Group 3) features.</p

    Performance improvement in binding region prediction models by incorporating chromatin state (CS) and DNA structure (DS) features.

    No full text
    <p>(<i>A</i>,<i>C</i>) The relationship between binding region prediction performance of models using sequence motif (SM) only and SM+CS+DS for each TF when contrasting <i>(A)</i> bound and unbound regions of a TF and <i>(C)</i> regions bound by one TF compared to regions bound by. the other TFs. The triangle indicates the average performance. The line indicates 1-to-1 relationship. (<i>B</i>,<i>D</i>) The relationship between the improvement in F-measure when incorporating CS and DS and the F-measures of random forest classifications using SM-only when contrasting <i>(B)</i> bound and unbound regions of a TF and <i>(D)</i> regions bound by one TF compared to regions bound by the other TFs.</p

    The performances of cross-DBD validations based on predictions using <i>in silico</i> predicted nucleosome occupancy, DNA major groove geometry, and dinucleotide free energy.

    No full text
    <p>The five DBD families examined were helix-turn-helix (HTH, 6130 sites), zinc finger (ZF, 8372 sites), leucine zipper (LZ, 3560 sites), winged helix (WH, 1070 sites), and helix-loop-helix (HLH, 2944 sites). Each value in the heat map is the F-measure of a model trained with the dataset of DBD<sub>x</sub> family member binding regions to predict the test dataset consisting of binding regions of TFs with DBD<sub>y</sub>. The F-measures on the diagonal are obtained by 10-fold cross-validation.</p
    corecore