612 research outputs found

    Robust optimization of SVM hyperparameters in the classification of bioactive compounds

    Get PDF
    Background: Support Vector Machine has become one of the most popular machine learning tools used in vir - tual screening campaigns aimed at finding new drug candidates. Although it can be extremely effective in finding new potentially active compounds, its application requires the optimization of the hyperparameters with which the assessment is being run, particularly the C and γ values. The optimization requirement in turn, establishes the need to develop fast and effective approaches to the optimization procedure, providing the best predictive power of the constructed model. Results: In this study, we investigated the Bayesian and random search optimization of Support Vector Machine hyperparameters for classifying bioactive compounds. The effectiveness of these strategies was compared with the most popular optimization procedures—grid search and heuristic choice. We demonstrated that Bayesian optimiza- tion not only provides better, more efficient classification but is also much faster—the number of iterations it required for reaching optimal predictive performance was the lowest out of the all tested optimization methods. Moreover, for the Bayesian approach, the choice of parameters in subsequent iterations is directed and justified; therefore, the results obtained by using it are constantly improved and the range of hyperparameters tested provides the best over - all performance of Support Vector Machine. Additionally, we showed that a random search optimization of hyperpa- rameters leads to significantly better performance than grid search and heuristic-based approaches. Conclusions: The Bayesian approach to the optimization of Support Vector Machine parameters was demonstrated to outperform other optimization methods for tasks concerned with the bioactivity assessment of chemical com- pounds. This strategy not only provides a higher accuracy of classification, but is also much faster and more directed than other approaches for optimization. It appears that, despite its simplicity, random search optimization strategy should be used as a second choice if Bayesian approach application is not feasible

    The influence of negative training set size on machine learning-based virtual screening

    Get PDF
    BACKGROUND: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. RESULTS: The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. CONCLUSIONS: In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening

    The influence of the inactives subset generation on the performance of machine learning methods

    Get PDF
    Background: A growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. However, their effectiveness is strongly dependent on many different factors. Results: In this study, the influence of the way of forming the set of inactives on the classification process was examined: random and diverse selection from the ZINC database, MDDR database and libraries generated according to the DUD methodology. All learning methods were tested in two modes: using one test set, the same for each method of inactive molecules generation and using test sets with inactives prepared in an analogous way as for training. The experiments were carried out for 5 different protein targets, 3 fingerprints for molecules representation and 7 classification algorithms with varying parameters. It appeared that the process of inactive set formation had a substantial impact on the machine learning methods performance. Conclusions: The level of chemical space limitation determined the ability of tested classifiers to select potentially active molecules in virtual screening tasks, as for example DUDs (widely applied in docking experiments) did not provide proper selection of active molecules from databases with diverse structures. The study clearly showed that inactive compounds forming training set should be representative to the highest possible extent for libraries that undergo screening

    Multiple conformational states in retrospective virtual screening : homology models vs. crystal structures : beta-2 adrenergic receptor case study

    Get PDF
    Background: Distinguishing active from inactive compounds is one of the crucial problems of molecular docking, especially in the context of virtual screening experiments. The randomization of poses and the natural flexibility of the protein make this discrimination even harder. Some of the recent approaches to post-docking analysis use an ensemble of receptor models to mimic this naturally occurring conformational diversity. However, the optimal number of receptor conformations is yet to be determined. In this study, we compare the results of a retrospective screening of beta-2 adrenergic receptor ligands performed on both the ensemble of receptor conformations extracted from ten available crystal structures and an equal number of homology models. Additional analysis was also performed for homology models with up to 20 receptor conformations considered. Results: The docking results were encoded into the Structural Interaction Fingerprints and were automatically analyzed by support vector machine. The use of homology models in such virtual screening application was proved to be superior in comparison to crystal structures. Additionally, increasing the number of receptor conformational states led to enhanced effectiveness of active vs. inactive compounds discrimination. Conclusions: For virtual screening purposes, the use of homology models was found to be most beneficial, even in the presence of crystallographic data regarding the conformational space of the receptor. The results also showed that increasing the number of receptors considered improves the effectiveness of identifying active compounds by machine learning method

    The age of data-driven proteomics : how machine learning enables novel workflows

    Get PDF
    A lot of energy in the field of proteomics is dedicated to the application of challenging experimental workflows, which include metaproteomics, proteogenomics, data independent acquisition (DIA), non-specific proteolysis, immunopeptidomics, and open modification searches. These workflows are all challenging because of ambiguity in the identification stage; they either expand the search space and thus increase the ambiguity of identifications, or, in the case of DIA, they generate data that is inherently more ambiguous. In this context, machine learning-based predictive models are now generating considerable excitement in the field of proteomics because these predictive models hold great potential to drastically reduce the ambiguity in the identification process of the above-mentioned workflows. Indeed, the field has already produced classical machine learning and deep learning models to predict almost every aspect of a liquid chromatography-mass spectrometry (LC-MS) experiment. Yet despite all the excitement, thorough integration of predictive models in these challenging LC-MS workflows is still limited, and further improvements to the modeling and validation procedures can still be made. In this viewpoint we therefore point out highly promising recent machine learning developments in proteomics, alongside some of the remaining challenges
    corecore