8 research outputs found

    Winnow based identification of potent hERG inhibitors in silico: comparative assessment on different datasets

    Get PDF
    RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Peer Reviewe

    Tuning hERG Out: Antitarget QSAR Models for Drug Development

    Get PDF
    Several non-cardiovascular drugs have been withdrawn from the market due to their inhibition of hERG K+ channels that can potentially lead to severe heart arrhythmia and death. As hERG safety testing is a mandatory FDA-required procedure, there is a considerable interest for developing predictive computational tools to identify and filter out potential hERG blockers early in the drug discovery process. In this study, we aimed to generate predictive and well-characterized quantitative structure–activity relationship (QSAR) models for hERG blockage using the largest publicly available dataset of 11,958 compounds from the ChEMBL database. The models have been developed and validated according to OECD guidelines using four types of descriptors and four different machine-learning techniques. The classification accuracies discriminating blockers from non-blockers were as high as 0.83–0.93 on external set. Model interpretation revealed several SAR rules, which can guide structural optimization of some hERG blockers into non-blockers. We have also applied the generated models for screening the World Drug Index (WDI) database and identify putative hERG blockers and non-blockers among currently marketed drugs. The developed models can reliably identify blockers and non-blockers, which could be useful for the scientific community. A freely accessible web server has been developed allowing users to identify putative hERG blockers and non-blockers in chemical libraries of their interest (http://labmol.farmacia.ufg.br/predherg)

    Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Datasets

    Get PDF
    The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non-linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non-linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non-linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using Open Source programs, which we have made available to the community. These programs are the rfFC package [https://r-forge.r-project.org/R/?group_id=1725] for the R Statistical Programming Language, along with a Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for Heat Map generation

    Integration of Data Quality, Kinetics and Mechanistic Modelling into Toxicological Assessment of Cosmetic Ingredients

    Get PDF
    In our modern society we are exposed to many natural and synthetic chemicals. The assessment of chemicals with regard to human safety is difficult but nevertheless of high importance. Beside clinical studies, which are restricted to potential pharmaceuticals only, most toxicity data relevant for regulatory decision-making are based on in vivo data. Due to the ban on animal testing of cosmetic ingredients in the European Union, alternative approaches, such as in vitro and in silico tests, have become more prevalent. In this thesis existing non-testing approaches (i.e. studies without additional experiments) have been extended, e.g. QSAR models, and new non-testing approaches, e.g. in vitro data supported structural alert systems, have been created. The main aspect of the thesis depends on the determination of data quality, improving modelling performance and supporting Adverse Outcome Pathways (AOPs) with definitions of structural alerts and physico-chemical properties. Furthermore, there was a clear focus on the transparency of models, i.e. approaches using algorithmic feature selection, machine learning etc. have been avoided. Furthermore structural alert systems have been written in an understandable and transparent manner. Beside the methodological aspects of this work, cosmetically relevant examples of models have been chosen, e.g. skin penetration and hepatic steatosis. Interpretations of models, as well as the possibility of adjustments and extensions, have been discussed thoroughly. As models usually do not depict reality flawlessly, consensus approaches of various non-testing approaches and in vitro tests should be used to support decision-making in the regulatory context. For example within read-across, it is feasible to use supporting information from QSAR models, docking, in vitro tests etc. By applying a variety of models, results should lead to conclusions being more usable/acceptable within toxicology. Within this thesis (and associated publications) novel methodologies on how to assess and employ statistical data quality and how to screen for potential liver toxicants have been described. Furthermore computational tools, such as models for skin permeability and dermal absorption, have been created

    Development and Evaluation of ADME Models Using Proprietary and Opensource Data

    Get PDF
    Absorption, Distribution, Metabolism and Elimination (ADME) properties are important factors in the drug discovery pipeline. Literature ADME data are often collected in large chemical databases like ChEMBL, which might be an asset to improve the prediction of ADME properties. Pharmaceutical companies build ADME Quantitative Structure Property Relationships (QSPR) models using proprietary data and thus the inclusion of literature data might be a valuable source for the development of predictive models. The aim of this study was to investigate whether merging literature and proprietary data could improve the predictive activity of proprietary models and enlarge their applicability domain (AD). ADME predictive models for Caco-2 (A to B) permeability and LogD7.4 were built with data extracted from Evotec and ChEMBL database. Predictive models were developed for each property and three different training sets were used based on: proprietary compounds (Evotec models), literature compounds (ChEMBL models) and a merged set of proprietary and literature compounds (Evotec+ChEMBL models). The Random Forest (RF), Partial Least Squares (PLS) and Support Vector Regression (SVR) were used to develop the models. The performance of the models was evaluated by using two types of test sets: a diverse test set (20 % compounds of available data randomly selected) and a temporal test set (data published after the models were built). The descriptors that used were the physiochemical descriptors, the structural Molecular Access System (MACCS) descriptors and the Partial equalisation of orbital electronegativity – van der Walls surface areas (Peoe-VSA) descriptors. The AD of the models was evaluated with four distance to model metrics, which were the: kNN with Euclidean distance, kNN with Manhattan distance, Leverage and Mahalanobis distance. The ability of an existing Evotec Caco-2 permeability model to assess literature compounds (extracted from ChEMBL) was evaluated. The literature test set was predicted with a higher RMSE compared to the RMSE in prediction for internal compounds. Additionally, a number of literature compounds was found to be outside the AD of the Evotec model, thus highlighting an area of improvement for proprietary Evotec models. Furthermore, the effect of the inclusion of literature data in the existing Caco-2 permeability and LogD7.4 Evotec proprietary models was evaluated. The RF algorithm was the highest performing method for the development of Caco-2 permeability models and the SVR for the LogD7.4 models. In addition, the leverage method proved to be the most appropriate for the evaluation of the models’ AD. The permeability model built merging literature and proprietary data (Evotec+ChEMBL model) predicted a literature temporal test set with an RMSE of 0.68 while the Evotec model showed an RMSE of 0.74. Even in the case of the Evotec temporal test set, the two models performed similarly and the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 86.15% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 76.50% of the compounds of the same test set appeared to be within the AD of the Evotec model. Similarly, the LogD7.4 Evotec+ChEMBL model predicted a literature temporal test set with an RMSE of 0.77 while the Evotec model showed an RMSE of 0.83. Even in the case of the Evotec temporal test set, the two models performed similarly but the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 94.86% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 88.49% of the compounds of the same test set appeared to be within the AD of the Evotec model. This study demonstrated that the inclusion of public ADME data into proprietary models improved the performance of proprietary models and enlarged at the same time their AD. The methodology presented herein will be applied by Evotec computational scientists to re-build the Caco-2 and LogD7.4 Evotec proprietary models considering literature data as discussed in this thesis

    In silico screening on the herg potassium channel

    Get PDF
    Während des Arzneistoffentwicklungsprozesses scheitern fast 35% der Arzneistoffe wegen schlechter Absorption, Verteilung, Metabolismus, Ausscheidung und Toxizität (ADMET). Ein wichtiger Bestandteil dieses Scheiterns ist die Interaktion mit Anti-Target Proteinen wie Cytochrom P450, P-glycoprotein und dem hERG Kaliumkanal. Der hERG Kaliumkanal ist in vielen verschiedenen Zellen und Geweben wie dem Herz, Nerven und glatten Muskelzellen vorhanden. Im Herzen spielt der hERG Kanal während des Aktionspotentials in der dritten Phase der kardialen Repolarisierung wegen der Weiterleitung des schnellen Kalium Ausstroms (Ikr) eine wichtige Rolle. Ein Verzögern dieser Phase führt zum Long QT Syndrom (LQTs), das eine potenziell tödliche Arrhythmie verursachen kann. Viele Klassen von Medikamenten wurden wegen ihren Wechselwirkungen mit dem hERG Kanal in den letzten zehn Jahren vom Markt zurückgezogen. Wie auch andere Anti-Target Proteine, ist der hERG Kanal in der Ligandenerkennung unspezifisch, weshalb er mit vielen Klassen von Arzneistoffen wie Psychopharmaka, Antihistaminika, Antiarrhythmika und Antibiotika interagieren kann. Viele Studien zeigen, dass eine erhebliche Anzahl von Molekülen während der Schließung des Kanals nicht dissoziieren und im geschlossenen Zustand des hERG Kanals gefangen bleiben. In dieser Studie wurden Propafenon und dessen Derivate in ein Homologie-Modell des hERG Kanals im geschlossenen und geöffneten Zustand gedockt, um die hERG Hemmung und das „drug trapping“ besser verstehen zu können. Ziel war es, die Wechselwirkungen zwischen dem hERG Kanal im geschlossenen Zustand und den Liganden zu untersuchen. Aufgrund dessen wurde eine Serie von „trapped“ Propafenon- Derivaten im hERG Kanal, welcher sich im geschlossenen Zustand befand, mit Dock, einem Docking Modul des Programms MOE, und GLIDE, dem Docking-Programm von Schrödinger, gedockt. Es wurde ein svl-Skript, genannt ROTALI, verwendet, um RMSD Matrizen zu erstellen, mit welchen die Duplikate unter den Posen, die in Bezug auf die Central Cavity unterschiedlich positioniert waren, zu erkennen und zu löschen. In weiterer Forlge wurden die möglichen binding modes durch agglomeratives hierarchisches Clustering identifiziert. Die Analyse der Posen führte zur Identifizierung von zwei möglichen Binding Modes. Derselbe Prozess wurde angewandt, um eine Serie von Propafenon-Derivaten in ein Homologie- Modell des hERG Kanals im geöffneten Zustand zu docken. Drei mögliche Binding Modes wurden durch die agglomerative Cluster Analyse der RMSD Matrix identifiziert, welche durch das gemeinsame Gerüst der Propafenon Derivate und jenen Aminosäuren generiert wurde, die mit den Molekülen interagierten. Um die Flexibilität des Proteins zu berücksichtigen wurden die Propafenon Derivate zusätzlich in acht verschiedene Schnappschüsse einer Moleküldynamik des Homologie-Modelles des hERG Kanals im geöffneten Zustand gedockt. In diesem Fall wurden zwei Binding Modes selektiert. Interessanterweise war es durch das Einordnen der Posen der fünf oben genannten Cluster nach der potenziellen Energie des R1 Substituenten, geteilt durch die Anzahl an Schweratomen, möglich, zwischen den „Trapped“ und „non-Trapped“ Propafenon-Derivaten zu unterscheiden. Dieser Wert war bei den „non-Trapped“ Substanzen immer höher als bei den „Trapped“ Molekülen. Der Umstand, dass dies auch bei den Vertretern des fünften Clusters möglich ist, bei denen der R1 Substituent unterhalb der vier Phe656 zum Liegen kommt, deutet darauf hin, dass das Phänomen des Drug-Trappings mehr auf die inhärenten Eigenschaften des R1 Substituenten als auf seine Konformation zurückzuführen ist, wenn er mit dem hERG Kanal interagiert. Dies könnte bedeuten, dass die Starrheit und die Sperrigkeit der Substituenten bestimmt ob Propafenon und dessen Derivate „Trapped“ sind oder nicht, unabhängig vom Bindemodus im hERG Kanal.During the drug development process, almost 35% of the compounds fail due to poor absorption, distribution, metabolism, excretion and toxicity (ADMET). An important role on these failures is played by improper interactions with antitarget proteins, such as cytocrome P450, P-glycoprotein and the hERG potassium channel. The hERG potassium channel is expressed in various cells and tissues, such as heart, neurons and smooth muscle. In the heart, the hERG channel plays an important role in the third phase of heart repolarization, due to the conduction of the rapid delayed rectifier K+ current (Ikr). A delay of this phase of repolarization leads to a syndrome called Long QT syndrome (LQTs) which might cause a potentially fatal arrhythmia called Torsade de Pointes (TdP). Many different classes of compounds were withdrawn from the market in the past decade due to their interaction with the hERG channel. Similar to other antitarget proteins, the hERG channel is polyspecific in the ligand recognition, hence it can interact with many classes of compounds, such as psychiatric, antihistaminic, antiarrhytmic and antimicrobial drugs. Several studies show that some molecules do not dissociate during the channel gating and are trapped in the closed state of the hERG channel. In this study, propafenone and derivatives were docked into homology models of the hERG channel in the closed and open states to shed more light on hERG inhibition and on drug trapping. With the aim to investigate the interactions between the hERG channel in the closed state and the compounds investigated, a series of trapped propafenone derivatives were docked into the homology model of the hERG channel in the closed conformation using Dock, the docking tool of MOE, and Glide, the docking tool of Schrödinger. A svl script called ROTALI was used to generate RMSD matrices with which the duplicate poses lying in different directions of the central cavity were detected and deleted, thus allowing to identify possible binding modes through agglomerative hierarchical clustering. This analysis led to the identification of two possible binding modes. The same process was applied to the poses obtained by docking the propafenones into a homology model of the hERG channel in the open state. Three possible binding modes were selected through agglomerative cluster analysis of the RMSD matrix generated taking into account the propafenone derivatives’ common scaffold and the amino acids that might interact. Finally, in order to take into account protein flexibility, nine propafenone derivatives were docked into eight models of the hERG channel in the open state obtained from snapshots of molecular dynamics simulations. Clustering both according to the common scaffold RMSD and the RMSD matrix of the amino acids interacting with the poses, two binding modes were selected. Biological studies suggest that non-trapped propafenones hinder the hERG channel gating with a mechanism called “foot in the door”. In four out of the five selected clusters, it is possible to explain the “foot in the door” mechanism. Interestingly, ranking the poses of the five clusters above-mentioned according to the potential energy values of the R1 substituent, and according to this value divided by the number of heavy atoms, it is possible to distinguish between trapped and non-trapped propafenones. In the nontrapped compounds, this value is always higher than in the trapped ones. The fact that it works also in cluster five, where the R1 substituents are placed under the ring formed by the four Phe656, might indicate that drug trapping phenomena depend more on intrinsic properties of the R1 susbstituent rather than on its conformation when it interacts with the hERG channel. Hence, this might indicate that the rigidity and the bulkyness of the substituent determines whether a propafenone derivatives is trapped or not independently of the binding mode in the hERG channel

    Classifier Design to Improve Pattern Classification and Knowledge Discovery for Imbalanced Datasets

    Get PDF
    Imbalanced dataset mining is a nontrivial issue. It has extensive applications in a variety of fields, such as scientific research, medical diagnosis, business, multiple industries, etc. Standard machine learning algorithms fail to produce satisfactory classifiers: they tend to over-fit the larger class but ignore the smaller class. Numerous algorithms have been developed to handle class imbalance, and limited progress has been achieved in improving prediction accuracy for smaller class. However, real world datasets may have hidden detrimental characteristics other than class imbalance. Those characteristics usually are dataset specific, and can fail otherwise robust algorithms for other imbalanced datasets. Mining such datasets can only be improved by algorithms tailored to domain characteristics (Weiss, 2004); therefore, it is important and necessary to do exploratory data analysis before classifier design. On the other hand, unmet needs in knowledge discovery, such as lead optimization during drug discovery, demand novel algorithms. In this study, we have developed a framework for imbalanced dataset mining tailored to data characteristics and adapted to knowledge discovery in chemical datasets. First, we explored the dataset and visualized domain characteristics, and then we designed different classifiers accordingly: for class imbalance, active learning (AL), cost sensitive learning (CSL) and re-sampling methods were designed; for class overlap, Class Boundary Cleaning (CBC) and Class Boundary Mining (CBM) were developed. CBM was also designed for lead optimization: ideally it would detect fine structural differences between different classes of compounds; and these differences could be options for lead optimization. Methods developed were applied to two datasets, hERG and CPDB. The results from imbalanced hERG liability dataset showed that CBC, CBM and AL were effective in correcting class imbalance/overlap and improving the classifier's performance. Highly predictive models were built; discriminating patterns were discovered; and lead optimization options were proposed. The methodology developed and knowledge discovered will benefit drug discovery, improve hazard test prioritization, risk assessment, and governmental regulatory work on human health and the environmental protection.Doctor of Philosoph
    corecore