In Silico Target Predictions: Defining a Benchmarking
Data Set and Comparison of Performance of the Multiclass Naïve
Bayes and Parzen-Rosenblatt Window
- Publication date
- Publisher
Abstract
In
this study, two probabilistic machine-learning algorithms were compared
for in silico target prediction of bioactive molecules, namely the
well-established Laplacian-modified Naïve Bayes classifier
(NB) and the more recently introduced (to Cheminformatics) Parzen-Rosenblatt
Window. Both classifiers were trained in conjunction with circular
fingerprints on a large data set of bioactive compounds extracted
from ChEMBL, covering 894 human protein targets with more than 155,000
ligand-protein pairs. This data set is also provided as a benchmark
data set for future target prediction methods due to its size as well
as the number of bioactivity classes it contains. In addition to evaluating
the methods, different performance measures were explored. This is
not as straightforward as in binary classification settings, due to
the number of classes, the possibility of multiple class memberships,
and the need to translate model scores into “yes/no”
predictions for assessing model performance. Both algorithms achieved
a recall of correct targets that exceeds 80% in the top 1% of predictions.
Performance depends significantly on the underlying diversity and
size of a given class of bioactive compounds, with small classes and
low structural similarity affecting both algorithms to different degrees.
When tested on an external test set extracted from WOMBAT covering
more than 500 targets by excluding all compounds with Tanimoto similarity
above 0.8 to compounds from the ChEMBL data set, the current methodologies
achieved a recall of 63.3% and 66.6% among the top 1% for Naïve
Bayes and Parzen-Rosenblatt Window, respectively. While those numbers
seem to indicate lower performance, they are also more realistic for
settings where protein targets need to be established for novel chemical
substances