1 research outputs found
Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?
Prior to using a quantitative structure activity relationship
(QSAR)
model for external predictions, its predictive power should be established
and validated. In the absence of a true external data set, the best
way to validate the predictive ability of a model is to perform its
statistical external validation. In statistical external validation,
the overall data set is divided into training and test sets. Commonly,
this splitting is performed using random division. Rational splitting
methods can divide data sets into training and test sets in an intelligent
fashion. The purpose of this study was to determine whether rational
division methods lead to more predictive models compared to random
division. A special data splitting procedure was used to facilitate
the comparison between random and rational division methods. For each
toxicity end point, the overall data set was divided into a modeling
set (80% of the overall set) and an external evaluation set (20% of
the overall set) using random division. The modeling set was then
subdivided into a training set (80% of the modeling set) and a test
set (20% of the modeling set) using rational division methods and
by using random division. The Kennard-Stone, minimal test set dissimilarity,
and sphere exclusion algorithms were used as the rational division
methods. The hierarchical clustering, random forest, and <i>k</i>-nearest neighbor (<i>k</i>NN) methods were used to develop
QSAR models based on the training sets. For <i>k</i>NN QSAR,
multiple training and test sets were generated, and multiple QSAR
models were built. The results of this study indicate that models
based on rational division methods generate better statistical results
for the test sets than models based on random division, but the predictive
power of both types of models are comparable