Search CORE

8 research outputs found

The relative importance of features for predicting binding regions of different TFs.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

Importance was defined as the decrease in accuracy after dropping a feature. The accuracy range was normalized to [0, 1] for each TF, where 0 is blue and 1 is red. The TFs were grouped into three classes as shown in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.g003" target="_blank">Fig 3</a>. Arrowheads indicate the most important features for predicting binding regions for most TFs.</p

The Francis Crick Institute

Expression patterns of genes with accessible motif occurrences are more highly correlated than those with inaccessible motifs.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

1. The number of time points in the time-series gene expression experiment.2. The significance of difference between two correlation coefficient distributions (within-group pairwise correlations of bound and unbound sets) by one-sided KS test.Expression patterns of genes with accessible motif occurrences are more highly correlated than those with inaccessible motifs.</p

The Francis Crick Institute

The 23 features used in this study.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

1.Values of the two sequence motif features were generated by scanning sequence with Position Weight Matrices from ScerTF [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref043" target="_blank">43</a>].2.p bound or unbound genomic region for a TF in question based on ChIP data3.The 11 chromatin state features (CS) were obtained from Pokholok et al. [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref044" target="_blank">44</a>]. The values for H3, ESA1, and GCN5 are averaged value over the analyzed genomic region.4.The 10 DNA structure features (DS) were generated from principal component analysis (PCA, see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#sec009" target="_blank">Methods</a>) on 125 DNA structure properties from DiProDB [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.ref041" target="_blank">41</a>]. PC1-5: the average of principle component values over each target genomic region. ΔPC1-5: the difference between the average value of a particular principle component over a target genomic region and the average value of the same principle component over regions flanking the target.5.The biological meaning was interpreted from top 10 dinucleotide properties having highest PCA loading coefficients.The 23 features used in this study.</p

The Francis Crick Institute

Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date: 01/08/2015
Field of study

<div>Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome.</div

Directory of Open Access Journals

PubMed Central

The Francis Crick Institute

Evaluation of features distinguishing between bound and unbound regions and between regions bound by a single TF compared to the other TFs.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

(A,B) The p-values (color scale shown, adjusted by false discovery rate control for multiple testing) from two-sided Wilcoxon rank sum tests of differences in feature values (A) between bound and unbound regions of all the 40 analyzed TFs jointly (ALL) and separately, and (B) between bound regions of a single vs. the remaining TFs. The p-values for (A) and (B) are shown in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.s001" target="_blank">S1 Fig</a> and <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004418#pcbi.1004418.s002" target="_blank">S2 Fig</a>, respectively. (C,D) The value distributions of the 23 features for regions bound (black) and not bound (white) by (C) RAP1 and (D) ZAP1, respectively. The values were normalized into [0, 1] for each feature. The p-values of two-tailed Wilcoxon rank sum tests are shown below the boxplots: red, p < 10−3; white, p = 10−3; blue, p > 10−3.</p

The Francis Crick Institute

Contribution of SM, CS, and DS features to overall and individual TF binding region prediction.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

(A) The F-measure distributions of random forest classifications with different individual features or combinations of features. The y-axis indicates the probability with a specific F-measure. The arrowheads indicate the average F-measures. (B) The relationship between F-measures of binding predictions based on CS features only and DS features only. The dotted line shows the 1-to-1 relationship. Points in green (yellow) represent the TFs in which performance is better in the DS-only (CS-only) model. (C) Heat map showing the relative performance (i.e. standardized the F-measures to mean zero and variance one) in predicting binding region of each TF using individual features or combinations of features. The TFs are grouped into three classes: TFs with binding regions that can be predicted by either CS or DS (Group 1); TFs with binding regions that cannot be predicted well with only CS (Group 2) or only DS (Group 3) features.</p

The Francis Crick Institute

Performance improvement in binding region prediction models by incorporating chromatin state (CS) and DNA structure (DS) features.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

(A,C) The relationship between binding region prediction performance of models using sequence motif (SM) only and SM+CS+DS for each TF when contrasting (A) bound and unbound regions of a TF and (C) regions bound by one TF compared to regions bound by. the other TFs. The triangle indicates the average performance. The line indicates 1-to-1 relationship. (B,D) The relationship between the improvement in F-measure when incorporating CS and DS and the F-measures of random forest classifications using SM-only when contrasting (B) bound and unbound regions of a TF and (D) regions bound by one TF compared to regions bound by the other TFs.</p

The Francis Crick Institute

The performances of cross-DBD validations based on predictions using in silico predicted nucleosome occupancy, DNA major groove geometry, and dinucleotide free energy.

Author: Huai-Kuang Tsai (35999)
Shin-Han Shiu (24396)
Zing Tsung-Yeh Tsai (785338)
Publication venue
Publication date
Field of study

The five DBD families examined were helix-turn-helix (HTH, 6130 sites), zinc finger (ZF, 8372 sites), leucine zipper (LZ, 3560 sites), winged helix (WH, 1070 sites), and helix-loop-helix (HLH, 2944 sites). Each value in the heat map is the F-measure of a model trained with the dataset of DBDx family member binding regions to predict the test dataset consisting of binding regions of TFs with DBDy. The F-measures on the diagonal are obtained by 10-fold cross-validation.</p

The Francis Crick Institute