12 research outputs found

    Application of Support Vector Machines in Virtual Screening

    Get PDF
    Traditionally drug discovery has been a labor intensive effort, since it is difficult to identify a possible drug candidate from an extremely large small molecule library for any given target. Most of the small molecules fail to show any activity against the target because of electrochemical, structural and other incompatibilities. Virtual screening is an in-silico approach to identify drug candidates which are unlikely to show any activity against a given target, thus reducing an enormous amount of experimentation which is most likely to end up as failures. Important approaches in virtual screening have been through docking studies and using classification techniques. Support vector machines based classifiers, based on the principles of statistical learning theory have found several applications in virtual screening. In this paper, first the theory and main principles of SVM are briefly outlined. Thereafter a few successful applications of SVM in virtual screening have been discussed. It further underlines the pitfalls of the existing approaches and highlights the area which needs further contribution to improve the state of the art for application of SVM in virtual screening

    Prediction of Drug-Likeness Using Deep Autoencoder Neural Networks

    Get PDF
    Due to diverse reasons, most drug candidates cannot eventually become marketed drugs. Developing reliable computational methods for prediction of drug-likeness of candidate compounds is of vital importance to improve the success rate of drug discovery and development. In this study, we used a fully connected neural networks (FNN) to construct drug-likeness classification models with deep autoencoder to initialize model parameters. We collected datasets of drugs (represented by ZINC World Drug), bioactive molecules (represented by MDDR and WDI), and common molecules (represented by ZINC All Purchasable and ACD). Compounds were encoded with MOLD2 two-dimensional structure descriptors. The classification accuracies of drug-like/non-drug-like model are 91.04% on WDI/ACD databases, and 91.20% on MDDR/ZINC, respectively. The performance of the models outperforms previously reported models. In addition, we develop a drug/non-drug-like model (ZINC World Drug vs. ZINC All Purchasable), which distinguishes drugs and common compounds, with a classification accuracy of 96.99%. Our work shows that by using high-latitude molecular descriptors, we can apply deep learning technology to establish state-of-the-art drug-likeness prediction models

    Predicting a small molecule-kinase interaction map: A machine learning approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present a machine learning approach to the problem of protein ligand interaction prediction. We focus on a set of binding data obtained from 113 different protein kinases and 20 inhibitors. It was attained through ATP site-dependent binding competition assays and constitutes the first available dataset of this kind. We extract information about the investigated molecules from various data sources to obtain an informative set of features.</p> <p>Results</p> <p>A Support Vector Machine (SVM) as well as a decision tree algorithm (C5/See5) is used to learn models based on the available features which in turn can be used for the classification of new kinase-inhibitor pair test instances. We evaluate our approach using different feature sets and parameter settings for the employed classifiers. Moreover, the paper introduces a new way of evaluating predictions in such a setting, where different amounts of information about the binding partners can be assumed to be available for training. Results on an external test set are also provided.</p> <p>Conclusions</p> <p>In most of the cases, the presented approach clearly outperforms the baseline methods used for comparison. Experimental results indicate that the applied machine learning methods are able to detect a signal in the data and predict binding affinity to some extent. For SVMs, the binding prediction can be improved significantly by using features that describe the active site of a kinase. For C5, besides diversity in the feature set, alignment scores of conserved regions turned out to be very useful.</p

    Statistical methods for analysis and correction of high-throughput screening data

    Get PDF
    Durant le criblage Ć  haut dĆ©bit (High-throughput screening, HTS), la premiĆØre Ć©tape dans la dĆ©couverte de mĆ©dicaments, le niveau d'activitĆ© de milliers de composĆ©s chimiques est mesurĆ© afin d'identifier parmi eux les candidats potentiels pour devenir futurs mĆ©dicaments (i.e., hits). Un grand nombre de facteurs environnementaux et procĆ©duraux peut affecter nĆ©gativement le processus de criblage en introduisant des erreurs systĆ©matiques dans les mesures obtenues. Les erreurs systĆ©matiques ont le potentiel de modifier de maniĆØre significative les rĆ©sultats de la sĆ©lection des hits, produisant ainsi un grand nombre de faux positifs et de faux nĆ©gatifs. Des mĆ©thodes de correction des donnĆ©es HTS ont Ć©tĆ© dĆ©veloppĆ©es afin de modifier les donnĆ©es reƧues du criblage et compenser pour l'effet nĆ©gatif que les erreurs systĆ©matiques ont sur ces donnĆ©es (Heyse 2002, Brideau et al. 2003, Heuer et al. 2005, Kevorkov and Makarenkov 2005, Makarenkov et al. 2006, Malo et al. 2006, Makarenkov et al. 2007). Dans cette thĆØse, nous Ć©valuons d'abord l'applicabilitĆ© de plusieurs mĆ©thodes statistiques servant Ć  dĆ©tecter la prĆ©sence d'erreurs systĆ©matiques dans les donnĆ©es HTS expĆ©rimentales, incluant le x2 goodness-of-fit test, le t-test et le test de Kolmogorov-Smirnov prĆ©cĆ©dĆ© par la mĆ©thode de Transformation de Fourier. Nous montrons premiĆØrement que la dĆ©tection d'erreurs systĆ©matiques dans les donnĆ©es HTS brutes est rĆ©alisable, de mĆŖme qu'il est Ć©galement possible de dĆ©terminer l'emplacement exact (lignes, colonnes et plateau) des erreurs systĆ©matiques de l'essai. Nous recommandons d'utiliser une version spĆ©cialisĆ©e du t-test pour dĆ©tecter l'erreur systĆ©matique avant la sĆ©lection de hits afin de dĆ©terminer si une correction d'erreur est nĆ©cessaire ou non. Typiquement, les erreurs systĆ©matiques affectent seulement quelques lignes ou colonnes, sur certains, mais pas sur tous les plateaux de l'essai. Toutes les mĆ©thodes de correction d'erreur existantes ont Ć©tĆ© conƧues pour modifier toutes les donnĆ©es du plateau sur lequel elles sont appliquĆ©es et, dans certains cas, mĆŖme toutes les donnĆ©es de l'essai. Ainsi, lorsqu'elles sont appliquĆ©es, les mĆ©thodes existantes modifient non seulement les mesures expĆ©rimentales biaisĆ©es par l'erreur systĆ©matique, mais aussi de nombreuses donnĆ©es correctes. Dans ce contexte, nous proposons deux nouvelles mĆ©thodes de correction d'erreur systĆ©matique performantes qui sont conƧues pour modifier seulement des lignes et des colonnes sĆ©lectionnĆ©es d'un plateau donnĆ©, i.e., celles oĆ¹ la prĆ©sence d'une erreur systĆ©matique a Ć©tĆ© confirmĆ©e. AprĆØs la correction, les mesures corrigĆ©es restent comparables avec les valeurs non modifiĆ©es du plateau donnĆ© et celles de tout l'essai. Les deux nouvelles mĆ©thodes s'appuient sur les rĆ©sultats d'un test de dĆ©tection d'erreur pour dĆ©terminer quelles lignes et colonnes de chaque plateau de l'essai doivent ĆŖtre corrigĆ©es. Une procĆ©dure gĆ©nĆ©rale pour la correction des donnĆ©es de criblage Ć  haut dĆ©bit a aussi Ć©tĆ© suggĆ©rĆ©e. Les mĆ©thodes actuelles de sĆ©lection des hits en criblage Ć  haut dĆ©bit ne permettent gĆ©nĆ©ralement pas d'Ć©valuer la fiabilitĆ© des rĆ©sultats obtenus. Dans cette thĆØse, nous dĆ©crivons une mĆ©thodologie permettant d'estimer la probabilitĆ© de chaque composĆ© chimique d'ĆŖtre un hit dans le cas oĆ¹ l'essai contient plus qu'un seul rĆ©plicat. En utilisant la nouvelle mĆ©thodologie, nous dĆ©finissons une nouvelle procĆ©dure de sĆ©lection de hits basĆ©e sur la probabilitĆ© qui permet d'estimer un niveau de confiance caractĆ©risant chaque hit. En plus, de nouvelles mesures servant Ć  estimer des taux de changement de faux positifs et de faux nĆ©gatifs, en fonction du nombre de rĆ©plications de l'essai, ont Ć©tĆ© proposĆ©es. En outre, nous Ć©tudions la possibilitĆ© de dĆ©finir des modĆØles statistiques prĆ©cis pour la prĆ©diction informatique des mesures HTS. Remarquons que le processus de criblage expĆ©rimental est trĆØs coĆ»teux. Un criblage virtuel, in silico, pourrait mener Ć  une baisse importante de coĆ»ts. Nous nous sommes concentrĆ©s sur la recherche de relations entre les mesures HTS expĆ©rimentales et un groupe de descripteurs chimiques caractĆ©risant les composĆ©s chimiques considĆ©rĆ©s. Nous avons effectuĆ© l'analyse de redondance polynomiale (Polynomial Redundancy Analysis) pour prouver l'existence de ces relations. En mĆŖme temps, nous avons appliquĆ© deux mĆ©thodes d'apprentissage machine, rĆ©seaux de neurones et arbres de dĆ©cision, pour tester leur capacitĆ© de prĆ©diction des rĆ©sultats de criblage expĆ©rimentaux.\ud ______________________________________________________________________________ \ud MOTS-CLƉS DE Lā€™AUTEUR : criblage Ć  haut dĆ©bit (HTS), modĆ©lisation statistique, modĆ©lisation prĆ©dictive, erreur systĆ©matique, mĆ©thodes de correction d'erreur, mĆ©thodes d'apprentissage automatiqu

    Classifying ā€˜drug-likenessā€™ with kernel-based learning methods

    No full text
    In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the ā€˜drug-likenessā€™ of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7 % on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process

    Development and Interpretation of Machine Learning Models for Drug Discovery

    Get PDF
    In drug discovery, domain experts from different fields such as medicinal chemistry, biology, and computer science often collaborate to develop novel pharmaceutical agents. Computational models developed in this process must be correct and reliable, but at the same time interpretable. Their findings have to be accessible by experts from other fields than computer science to validate and improve them with domain knowledge. Only if this is the case, the interdisciplinary teams are able to communicate their scientific results both precisely and intuitively. This work is concerned with the development and interpretation of machine learning models for drug discovery. To this end, it describes the design and application of computational models for specialized use cases, such as compound profiling and hit expansion. Novel insights into machine learning for ligand-based virtual screening are presented, and limitations in the modeling of compound potency values are highlighted. It is shown that compound activity can be predicted based on high-dimensional target profiles, without the presence of molecular structures. Moreover, support vector regression for potency prediction is carefully analyzed, and a systematic misprediction of highly potent ligands is discovered. Furthermore, a key aspect is the interpretation and chemically accessible representation of the models. Therefore, this thesis focuses especially on methods to better understand and communicate modeling results. To this end, two interactive visualizations for the assessment of naive Bayes and support vector machine models on molecular fingerprints are presented. These visual representations of virtual screening models are designed to provide an intuitive chemical interpretation of the results

    Kern-basierte Lernverfahren fĆ¼r das virtuelle Screening

    Get PDF
    We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren fĆ¼r das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer GraphƤhnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse fĆ¼r Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von GauƟ-Prozess-Regression. Virtuelles Screening ist die rechnergestĆ¼tzte Priorisierung von MolekĆ¼len bezĆ¼glich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ƅhnlichkeitsprinzip, nach dem (strukturell) Ƥhnliche MolekĆ¼le tendenziell Ƥhnliche Eigenschaften haben. In unserer Beschreibung des Lƶsungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer ReprƤsentationen, einschlieƟlich der auf ihnen definierten (Un)ƤhnlichkeitsmaƟe. Wir untersuchen Effekte in hochdimensionalen chemischen DeskriptorrƤumen, ihre Auswirkungen auf Ƅhnlichkeits-basierte Verfahren und geben einen LiteraturĆ¼berblick zu Empfehlungen zur retrospektiven Validierung, einschlieƟlich eines Beispiel-Workflows. Graphkerne sind formale ƄhnlichkeitsmaƟe, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen LiteraturĆ¼berblick Ć¼ber Graphkerne, insbesondere solche, die auf zufƤlligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen GraphƤhnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lƶsung, und geben eine obere Schranke fĆ¼r die Anzahl der benƶtigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische DatensƤtze liegen oft auf Mannigfaltigkeiten niedrigerer DimensionalitƤt als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur VerfĆ¼gung. FĆ¼r spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives MaƟ dafĆ¼r, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von MolekĆ¼len verwendet werden, die strukturell unƤhnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection fĆ¼r virtuelles Screening mit ausschlieƟlich positiven Beispielen. Wir fĆ¼hren eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von FettsƤuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und DyslipidƤmie. Wir etablieren ein GauƟ-Prozess-Regressionsmodell fĆ¼r PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschlieƟende Testung 15 ausgewƤhlter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-GrundgerĆ¼st, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt RĆ¼ckschlĆ¼sse auf die NaturstoffursprĆ¼nge von Pharmakophormustern in synthetischen Liganden
    corecore