14 research outputs found

    Virtual screening for PPAR-gamma ligands using the ISOAK molecular graph kernel and gaussian processes

    Get PDF
    For a virtual screening study, we introduce a combination of machine learning techniques, employing a graph kernel, Gaussian process regression and clustered cross-validation. The aim was to find ligands of peroxisome-proliferator activated receptor gamma (PPAR-y). The receptors in the PPAR family belong to the steroid-thyroid-retinoid superfamily of nuclear receptors and act as transcription factors. They play a role in the regulation of lipid and glucose metabolism in vertebrates and are linked to various human processes and diseases. For this study, we used a dataset of 176 PPAR-y agonists published by Ruecker et al. ..

    How to Explain Individual Classification Decisions

    Full text link
    After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted the particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.Comment: 31 pages, 14 figure

    Kernel learning for ligand-based virtual screening: discovery of a new PPARgamma agonist

    Get PDF
    Poster presentation at 5th German Conference on Cheminformatics: 23. CIC-Workshop Goslar, Germany. 8-10 November 2009 We demonstrate the theoretical and practical application of modern kernel-based machine learning methods to ligand-based virtual screening by successful prospective screening for novel agonists of the peroxisome proliferator-activated receptor gamma (PPARgamma) [1]. PPARgamma is a nuclear receptor involved in lipid and glucose metabolism, and related to type-2 diabetes and dyslipidemia. Applied methods included a graph kernel designed for molecular similarity analysis [2], kernel principle component analysis [3], multiple kernel learning [4], and, Gaussian process regression [5]. In the machine learning approach to ligand-based virtual screening, one uses the similarity principle [6] to identify potentially active compounds based on their similarity to known reference ligands. Kernel-based machine learning [7] uses the "kernel trick", a systematic approach to the derivation of non-linear versions of linear algorithms like separating hyperplanes and regression. Prerequisites for kernel learning are similarity measures with the mathematical property of positive semidefiniteness (kernels). The iterative similarity optimal assignment graph kernel (ISOAK) [2] is defined directly on the annotated structure graph, and was designed specifically for the comparison of small molecules. In our virtual screening study, its use improved results, e.g., in principle component analysis-based visualization and Gaussian process regression. Following a thorough retrospective validation using a data set of 176 published PPARgamma agonists [8], we screened a vendor library for novel agonists. Subsequent testing of 15 compounds in a cell-based transactivation assay [9] yielded four active compounds. The most interesting hit, a natural product derivative with cyclobutane scaffold, is a full selective PPARgamma agonist (EC50 = 10 ± 0.2 microM, inactive on PPARalpha and PPARbeta/delta at 10 microM). We demonstrate how the interplay of several modern kernel-based machine learning approaches can successfully improve ligand-based virtual screening results

    Maschinelles Lernen zur Entwicklung von Medikamenten

    No full text
    In dieser Dissertation werden sieben Studien vorgestellt, die sich mit der Entwicklung prĂ€diktiver Modelle zur Anwendung in der Wirkstoffsuchforschung beschĂ€ftigen. Es wurden drei neue Algorithmen entwickelt um die Genauigkeit von Vorhersagen zu erhöhen, einzelne Vorhersagen zu erklĂ€ren sowie Hinweise zur Optimierung von MolekĂŒlen zu gewinnen. Konkret wurden Modelle fĂŒr die folgenden Eigenschaften chemischer Verbindungen entwickelt: Metabolische StabilitĂ€t, Ames MutagenitĂ€t, Wasserlöslichkeit, Verteilungskoeffizienten, Cytochrom P450 Inhibition, PPAR-gamma Bindung und den hERG-Ionenkanal Blockade Effekt. Aus Sicht des maschinellen Lernens ist die Chemoinformatik ein Anwendungsfeld mit vielen Herausforderungen, nicht nur, weil keine bis heute entwickelte ReprĂ€sentation chemischer MolekĂŒle deren dynamischen dreidimensionalen Charakter adĂ€quat beschreibt, sondern auch, weil in typischen AnwendungsfĂ€llen fundamentale Annahmen verletzt werden, die den meisten Algorithmen des maschinellen Lernens zu Grunde liegen. Weder werden Trainings- und Testdaten ideal identisch verteilt aus der gleichen Wahrscheinlichkeitsverteilung gezogen, noch sind die bedingten Wahrscheinlichkeiten fĂŒr die Labels (gemessenen Eigenschaften) bei gegebenen Features (Deskriptoren) fĂŒr Trainings- und Testdaten gleich. DarĂŒber hinaus zeigen alle Eigenschaften, die Molekulare Erkennung beinhalten, extreme SprĂŒnge, sog. Activity Cliffs. Um der Tatsache gerecht zu werden, dass, unabhĂ€ngig vom verwendeten Lernalgorithmus, eine große Zahl von Testverbindungen nicht akkurat vorhergesagt werden können, wurden Gauß-Prozess Modelle in die Chemoinformatik eingefĂŒhrt, denn deren prĂ€diktive Varianzen können direkt als SchĂ€tzung der ZuverlĂ€ssigkeit der Vorhersage interpretiert werden. Der praktische Nutzen dieses Vorgehens wurde in Studien zu Verteilungskoeffizienten, Wasserlöslichkeit und der Metabolischen StabilitĂ€t gezeigt. Es wurden zwei verschiedene Algorithmen entwickelt, um Vorhersagen (ggf. auch nicht-linearer) maschineller Lernmodelle zu erklĂ€ren. Die erste Methode erklĂ€rt Vorhersagen durch Visualisierung der relevantesten Objekte (MolekĂŒle) aus der Trainingsmenge des Modells. FĂŒr alle Verfahren des maschinellen Lernens, fĂŒr die das verallgemeinerte Representer Theorem gilt, kann man den normierten Beitrag jedes Trainingsobjekts zur Vorhersage analytisch berechnen. In einer Fallstudie zur Ames MutagenitĂ€t wurde gezeigt, dass, durch Anpassung der Kernweite von Gauß-Kernen, Gauß-Prozess Klassifikationsmodelle erzeugt werden können, deren Vorhersagen jeweils nahezu vollstĂ€ndig durch eine kleine Zahl von Trainingsobjekten determiniert werden. Diese fĂŒhren zu intuitiv verstĂ€ndlichen Visualisierungen, die auch aus chemischer Sicht ĂŒberzeugen. Der zweite Algorithmus verwendet lokale Gradienten der Vorhersage um die lokal wichtigsten Features (Deskriptoren) zu ermitteln. FĂŒr Gauß-Prozesse können die lokalen Gradienten analytisch ermittelt werden. In einer Fallstudie zur Ames MutagenitĂ€t wurden sowohl Toxikophore als auch Detoxikophore korrekt identifiziert und selbst eine lokale Besonderheit im chemischen Raum (das untypische Verhalten der Steroide) wurde erkannt. Obwohl Wirkstoffdesign die ursprĂŒngliche Motivation und das erste Anwendungsfeld fĂŒr die neuen Algorithmen zum ErklĂ€ren individueller Vorhersagen waren, lassen sich beide resultierenden Algorithmen auf eine große Vielfalt von Fragestellungen ĂŒbertragen. In jedem Bereich, in dem Menschen dabei unterstĂŒtzt werden sollen, Entscheidungen zu fĂ€llen, können ErklĂ€rungen von Modellvorhersagen wertvoll sein.This thesis presents seven studies about constructing predictive models for application in drug discovery and drug design. Three new algorithms have been developed to improve the accuracy of predictions, explain individual predictions and elicit hints for compound optimization. More specifically, predictive models for the following properties of chemical compounds have been developed: Metabolic Stability, Ames Mutagenicity, Aqueous Solubility, Partition Coefficients, Cytochrome P450 Inhibition, PPAR gamma binding and the hERG Channel Blockade Effect. From the point of view of machine learning, chemoinformatics is a very challenging field of endeavor, not only because as of today, no existing representation adequately captures the dynamical three dimensional nature of chemical molecules, but also because in typical drug discovery applications, fundamental assumptions common to most machine learning algorithms are severely violated. Neither are training and test data sampled ideally identically distributed from the same underlying probability density, nor is the conditional distribution of labels (measurements) given the input features (descriptors) the same in training and test data. Lastly, all properties concerned with molecular recognition can exhibit sudden extreme changes, so called activity cliffs. To cope with the fact that, regardless of the learning algorithm employed, many predictions for test compounds may not be correct, Gaussian Process models have been introduced into the field of chemoinformatics, because their predictive variances can directly serve as individual confidence estimates. The practical usefulness of predictive variances has been established in studies on Partition Coefficients, Aqueous Solubility and Metabolic Stability. Two separate algorithms for explaining individual predictions of (possibly non-linear) machine learning models are presented. The first method explains predictions by the means of visualizing relevant objects (molecules) from the training set of the model. For all machine learning methods covered by the generalized representer theorem, one can calculate the normalized contribution of each training data point analytically. In a case study on Ames Mutagenicity, it was found that by tuning the width-parameter of radial basis function kernels, Gaussian Process Classification models can be obtained where the prediction for each test compound is almost completely determined by very few training compounds, leading to intuitively understandable visualizations that were found to be convincing from a chemists point of view. The second algorithm utilizes local gradients of the model's predictions to obtain the locally most relevant features. In case of Gaussian Process models, local gradients can be calculated analytically. In a case study on Ames Mutagenicity, toxicophores and detoxicophores were identified correctly and even local peculiarities in chemical space (the extraordinary behavior of steroids) was discovered. While drug design served as the original motivation and testbed for developing algorithms for explaining individual predictions, both new methods can be applied to a wide range of modeling tasks. Wherever human experts are to be supported in making decisions, explanations of predictions will be valuable

    Programm der Schroeter'schen Erziehungsschule zu Jena (Elementar- und Realschule) : mit welchem zu der am Dienstag, den 20 MĂ€rz 1877, stattfindenden öffentlichen PrĂŒfung im Namen des Lehrercollegiums ergebenst einladet Dr. Timon Schroeter, Rector. (1876/77)

    No full text
    PROGRAMM DER SCHROETER'SCHEN ERZIEHUNGSSCHULE ZU JENA (ELEMENTAR- UND REALSCHULE) : MIT WELCHEM ZU DER AM DIENSTAG, DEN 20 MÄRZ 1877, STATTFINDENDEN ÖFFENTLICHEN PRÜFUNG IM NAMEN DES LEHRERCOLLEGIUMS ERGEBENST EINLADET DR. TIMON SCHROETER, RECTOR. Programm der Schroeter'schen Erziehungsschule zu Jena (-) Programm der Schroeter'schen Erziehungsschule zu Jena (Elementar- und Realschule) : mit welchem zu der am Dienstag, den 20 MĂ€rz 1877, stattfindenden öffentlichen PrĂŒfung im Namen des Lehrercollegiums ergebenst einladet Dr. Timon Schroeter, Rector. (1876/77) (1) Einband (1) Titelblatt (6) Einleitung (8) Lehrverfassung der Elementar- und Realschule (10) Statistische Nachrichten (14) Nachrichten aus dem Schulleben (15) Lehrmittel / Oeffentliche PrĂŒfungen und Festlichkeiten (16

    Assessing the respective contributions of dietary flavanol monomers and procyanidins in mediating cardiovascular effects in humans:Randomized-controlled, double-masked intervention trial

    Get PDF
    Background: Flavanols are an important class of food bioactives that can improve vascular function even in healthy subjects. Cocoa flavanols (CFs) are comprised principally of the monomer, (−)-epicatechin (~20%) with a degree of polymerisation of 1 (DP1), and oligomeric procyanidins (~80%, DP2-10). Objective: To investigate the relative contribution of procyanidins and (−)-epicatechin to CF intake-related improvements in vascular function in healthy volunteers. Design: In a randomized, controlled, double-masked, parallel-group dietary intervention trial, 45 healthy men, (18-35 years), consumed once daily for 1 month (a) a DP1-10 cocoa extract containing 130 mg of (−)-epicatechin and 560 mg of procyanidins (b) a DP2-10 cocoa extract containing 20 mg (−)-epicatechin and 540 mg procyanidins or (c) a Control that was flavanol-free with identical micro- and macronutrient composition. (ClinicalTrials.gov NCT02728466) Results: Consumption of DP1-10, but neither DP2-10 nor the Control, significantly increased flow-mediated vasodilation (primary endpoint), and the level of structurally-related (−)-epicatechin metabolites (SREMs) in the circulatory system, while decreasing pulse wave velocity and blood pressure. Total cholesterol significantly decreased after daily intake of both DP1-10 and DP2-10 as compared to the Control. Conclusions: CF-related improvements in vascular function predominantly relate to intake of flavanol monomers and circulating SREMs in healthy humans, but not to the more abundant procyanidins and gut microbiome-derived CF-catabolites. Reduction in total cholesterol was linked to consumption of procyanidins but not necessarily that of (-)-epicatechin.</p

    Machine Learning Models for Lipophilicity and their Domain of Applicability

    No full text
    Unfavorable lipophilicity and water solubility cause many drug failures, therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm—a Gaussian Process model—this study constructs a log D7 model based on 14556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81 % were predicted correctly within one log unit, compared to only 44 % achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model

    Estimating the Domain of Applicability for Machine Learning QSAR RModels: A Study on Aqueous Solubility of Drug Discovery Molecules

    No full text
    We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.
    corecore