Search CORE

4 research outputs found

LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity – Application to the Tox21 and Mutagenicity Datasets

Author: Mucs D
Norinder U
Svensson F
Zhang J
Publication venue: 'American Chemical Society (ACS)'
Publication date: 28/10/2019
Field of study

Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lower-cost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry

UCL Discovery

The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

Author: Alvarsson Jonathan
Berg Arvid
Carlsson Lars
Evelo Chris T.
Guha Rajarshi
Jeliazkova Nina
Kuhn Stefan
Mayfield John W.
Pluskal Tomas
Rojas-Cherto Miquel
Spjuth Ola
Steinbeck Christoph
Torrence Gilleain
Willighagen Egon L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

open access articleBackground: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. Results: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software

Maastricht University Research Portal

Crossref

Publikationer från Uppsala Universitet

Directory of Open Access Journals

De Montfort University Open Research Archive

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Leicester Research Archive

Benchmarking Study of Parameter Variation When Using Signature Fingerprints Together with Support Vector Machines

Author: Claes Andersson (615687)
Jarl E. S. Wikberg (1716286)
Jonathan Alvarsson (1716289)
Lars Carlsson (10323)
Martin Eklund (91839)
Ola Spjuth (91840)
Publication venue
Publication date
Field of study

QSAR modeling using molecular signatures and support vector machines with a radial basis function is increasingly used for virtual screening in the drug discovery field. This method has three free parameters: C, γ, and signature height. C is a penalty parameter that limits overfitting, γ controls the width of the radial basis function kernel, and the signature height determines how much of the molecule is described by each atom signature. Determination of optimal values for these parameters is time-consuming. Good default values could therefore save considerable computational cost. The goal of this project was to investigate whether such default values could be found by using seven public QSAR data sets spanning a wide range of end points and using both a bit version and a count version of the molecular signatures. On the basis of the experiments performed, we recommend a parameter set of heights 0 to 2 for the count version of the signature fingerprints and heights 0 to 3 for the bit version. These are in combination with a support vector machine using C in the range of 1 to 100 and γ in the range of 0.001 to 0.1. When data sets are small or longer run times are not a problem, then there is reason to consider the addition of height 3 to the count fingerprint and a wider grid search. However, marked improvements should not be expected

FigShare

Benchmarking Study of Parameter Variation When Using Signature Fingerprints Together with Support Vector Machines

Author: Bradley A. P.
Bruce C. L.
Burbidge R.
Carlsson L.
Chavatte P.
Chen H.
Claes Andersson
Eklund M.
Eklund M.
Faulon J.-L.
Faulon J.-L.
Faulon J.-L.
Gold L. S.
Gold L. S.
Gold L. S.
Hansch C.
Hansen K.
Jarl E. S. Wikberg
Jonathan Alvarsson
Lapins M.
Lars Carlsson
Martin Eklund
Matthews E. J.
Norinder U.
Norinder U.
Ola Spjuth
R Development Core Team
Robin X.
Rostkowski M.
Russom C. L.
Schumi J.
Smola A. J.
Spjuth O.
Spjuth O.
Steinbeck C.
Steinbeck C.
Sutherland J. J.
Vanii K.
Weis D. C.
Wickham H.
Publication venue: 'American Chemical Society (ACS)'
Publication date
Field of study

Crossref