56 research outputs found
Virtual Screening of Bioassay Data
Background: There are three main problems associated with the virtual screening of bioassay
data. The first is access to freely-available curated data, the second is the number of false positives
that occur in the physical primary screening process, and finally the data is highly-imbalanced with
a low ratio of Active compounds to Inactive compounds. This paper first discusses these three
problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and
Random Forest) are applied to a variety of bioassay datasets.
Results: Pharmaceutical bioassay data is not readily available to the academic community. The data
held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary
and Confirmatory screening assays. With regard to the number of false positives that occur in the
primary screening process, the analysis carried out has been shallow due to the lack of crossreferencing
mentioned above. In six cases found, the average percentage of false positives from the
High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's
implementations of the Support Vector Machine and C4.5 decision tree learner have performed
relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base
classifier used and not solely on the ratio of class imbalance.
Conclusions: Understandably, pharmaceutical data is hard to obtain. However, it would be
beneficial to both the pharmaceutical industry and to academics for curated primary screening and
corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual
screening techniques to bioassay data. First, by reducing the search space of compounds to be
screened and secondly, by analysing the false positives that occur in the primary screening process,
the technology may be improved. The number of false positives arising from primary screening
leads to the issue of whether this type of data should be used for virtual screening. Care when using
Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class
ratios should not be used when comparing differing classifiers for the same dataset
Support-Vektor-Maschinen und ihre Anwendung auf Datensätze aus der pharmazeutischen Forschung
Dreistufig parallele Software zur Parameteroptimierung von Support-Vektor-Maschinen mit kostensensitiven Gütemaßen
Analysis of Support Vector Machine Training Costs for Large and Unbalanced Data from Pharmaceutical Industry
On the Advantages of Weighted L1-Norm Support Vector Learning for Unbalanced Binary Classification Problems
Parallel Tuning of Support Vector Machine Learning Parameters for Large and Unbalanced Data Sets
- …