6 research outputs found
CP Viability Prediction Performance of Various Procedures.
a<p>Random forest was applied in this experiment to the assess the prediction power of closeness and farness.</p>b<p>A combination of the four machine learning methods (HI, ANN, RF and SVM) by averaging their probability scores into a single score. See the main text for details.</p>c<p>These results were obtained with 10-fold cross-validation.</p
Probability Scores of DHFR.
<p>The structure of the dihydrofolate reductase from <i>Escherichia coli</i> (PDB entry: 1RX4) is shown as a cross-eye stereo image, in which the thickness of backbone of a residue is in proportion to the probability score computed by our prediction system for that residue. In addition, probability scores are color-coded — a color closer to red represents a higher score. Gray- to black-colored residues have scores increasingly lower than 0.5. Among the 67 residues with probability scores ≥0.5, only 6 are inviable CP sites (shown in blue). The other 61 residues are experimentally-verified viable CP sites <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#pone.0031791-Iwakura1" target="_blank">[29]</a>. Thus, at a probability score threshold of 0.5, the precision of the developed prediction system for this independent evaluation dataset is 90% (61/67).</p
Performance of Predictions for Dataset T at Various Decision Thresholds of the Probability Score.
<p>Performance of Predictions for Dataset T at Various Decision Thresholds of the Probability Score.</p
Distributions and ROC Curves of Propensity Scores.
<p>Here, a propensity score was calculated as the relative propensity of a pattern between the background and viable CP sites weighted by 1 – <i>p-</i>value (see Formula 1). A high relative propensity and a small <i>p-</i>value resulted in a high score. A zero score means that there was no obvious difference between the frequencies of the pattern in the background and viable CP sites, or the difference was statistically insignificant. These plots show distributions of several propensity scores for the viable (red bars) and inviable (blue bars) CP sites of Dataset L and their ROC curves. Plots (<b>a</b>)–(<b>c</b>) and (<b>d</b>)–(<b>f</b>) respectively exhibit the results of sequence-based and secondary structure-based propensity scores. The distributions of the sequence-based propensity scores are not very different between the viable and inviable CP sites, and their AUCs are only ∼0.6. The distributions of secondary structure-based propensity scores were rather different between viable and inviable CP sites, and thus the AUCs were higher than those of sequence-based scores. The lower <i>x</i> axis in each plot indicates the propensity score. The left <i>y</i> axis indicates the frequency, <i>i.e.</i>, the proportion of residues falling into each score group. The upper <i>x</i> axis and right <i>y</i> axis represent the false positive rate and true positive rate, respectively, for the ROC curve.</p
Classification Tree of the 46 Selected Features.
<p>These features were selected based on their discriminatory performance for viable and inviable CPs in Dataset T. Redundant features (correlation coefficient >0.7) were screened out. The classification was done manually according to the similarities of biological meaning of these features. The purpose of this classification was to perform the hierarchical feature integration procedure developed in this work. The number following each feature abbreviation was the weight of that feature used in the hierarchical integration procedure. These weights were determined with the training Dataset T by exhaustive performance screening (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#s3" target="_blank"><b>Materials and Methods</b></a>). <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#pone.0031791.s013" target="_blank">Table S2</a> lists the complete meanings of the features abbreviated here.</p
Binary Classification Performance of Several Propensity Scores and Tertiary Structure-derived Residue Measures.
a<p>The values of these measures are all presented here with the format: mean ± standard deviation. A plus (+) after an abbreviation for certain measures indicates that hydrogen atoms were restored/added before those measures were calculated. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#pone-0031791-g002" target="_blank">Figures 2</a> and <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#pone-0031791-g003" target="_blank">3</a> for the meaning of abbreviations used for these measures.</p>b<p>For convenience, the optimal decision threshold of a score was determined as the score value corresponding to the point nearest to point (0,1) on the ROC curve <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0031791#pone.0031791-vanErkel1" target="_blank">[98]</a>. The sensitivity, specificity and MCC were obtained at the listed decision thresholds.</p