Benchmarking
Study of Parameter Variation When Using
Signature Fingerprints Together with Support Vector Machines
- Publication date
- Publisher
Abstract
QSAR modeling using
molecular signatures and support vector machines
with a radial basis function is increasingly used for virtual screening
in the drug discovery field. This method has three free parameters: <i>C</i>, γ, and signature height. <i>C</i> is
a penalty parameter that limits overfitting, γ controls the
width of the radial basis function kernel, and the signature height
determines how much of the molecule is described by each atom signature.
Determination of optimal values for these parameters is time-consuming.
Good default values could therefore save considerable computational
cost. The goal of this project was to investigate whether such default
values could be found by using seven public QSAR data sets spanning
a wide range of end points and using both a bit version and a count
version of the molecular signatures. On the basis of the experiments
performed, we recommend a parameter set of heights 0 to 2 for the
count version of the signature fingerprints and heights 0 to 3 for
the bit version. These are in combination with a support vector machine
using <i>C</i> in the range of 1 to 100 and γ in the
range of 0.001 to 0.1. When data sets are small or longer run times
are not a problem, then there is reason to consider the addition of
height 3 to the count fingerprint and a wider grid search. However,
marked improvements should not be expected