8 research outputs found

    TeachOpenCADD: a teaching platform for computer-aided drug design using open source packages and data

    Get PDF
    Owing to the increase in freely available software and data for cheminformatics and structural bioinformatics, research for computer-aided drug design (CADD) is more and more built on modular, reproducible, and easy-to-share pipelines. While documentation for such tools is available, there are only a few freely accessible examples that teach the underlying concepts focused on CADD, especially addressing users new to the field. Here, we present TeachOpenCADD, a teaching platform developed by students for students, using open source compound and protein data as well as basic and CADD-related Python packages. We provide interactive Jupyter notebooks for central CADD topics, integrating theoretical background and practical code. TeachOpenCADD is freely available on GitHub: https://github.com/volkamerlab/TeachOpenCAD

    KnowTox: pipeline and case study for confident prediction of potential toxic effects of compounds in early phases of development

    Get PDF
    Risk assessment of newly synthesised chemicals is a prerequisite for regulatory approval. In this context, in silico methods have great potential to reduce time, cost, and ultimately animal testing as they make use of the ever-growing amount of available toxicity data. Here, KnowTox is presented, a novel pipeline that combines three different in silico toxicology approaches to allow for confident prediction of potentially toxic effects of query compounds, i.e. machine learning models for 88 endpoints, alerts for 919 toxic substructures, and computational support for read-across. It is mainly based on the ToxCast dataset, containing after preprocessing a sparse matrix of 7912 compounds tested against 985 endpoints. When applying machine learning models, applicability and reliability of predictions for new chemicals are of utmost importance. Therefore, first, the conformal prediction technique was deployed, comprising an additional calibration step and per definition creating internally valid predictors at a given significance level. Second, to further improve validity and information efficiency, two adaptations are suggested, exemplified at the androgen receptor antagonism endpoint. An absolute increase in validity of 23% on the in-house dataset of 534 compounds could be achieved by introducing KNNRegressor normalisation. This increase in validity comes at the cost of efficiency, which could again be improved by 20% for the initial ToxCast model by balancing the dataset during model training. Finally, the value of the developed pipeline for risk assessment is discussed using two in-house triazole molecules. Compared to a single toxicity prediction method, complementing the outputs of different approaches can have a higher impact on guiding toxicity testing and de-selecting most likely harmful development-candidate compounds early in the development process

    Income Taxes, Sorting, and the Costs of Housing: Evidence from Municipal Boundaries in Switzerland

    Get PDF
    This paper provides novel evidence on the role of income taxes for residential rents and spatial sorting. Drawing on comprehensive apartment-level data, we identify the effects of tax differentials across municipal boundaries in Switzerland. The boundary discontinuity design (BDD) corrects for unobservable location characteristics such as environmental amenities or the access to public goods and thereby reduces the estimated response of housing prices by one half compared to conventional estimates: we identify an income tax elasticity of rents of about 0.26. We complement this approach with census data on local sociodemographic characteristics and show that about one third of this effect can be traced back to a sorting of high-income households into low-tax municipalities. These findings are robust to a matching approach (MBDD) which compares identical residences on opposite sides of the boundary and a number of further sensitivity checks

    Strategies to enhance the applicability of in silico toxicity prediction methods

    Get PDF
    Given the ubiquity of synthetic chemicals in our daily life, it is crucial to assess the hazardous effects of new chemical substances on humans, animals, and the environment. Toxicity assessment has traditionally been based on in vitro and in vivo studies, but ethical and economic arguments call for reduction and replacement of animal testing. Therefore, computational toxicity prediction has gained momentum to support toxicity studies and to ultimately reduce animal testing. In silico approaches are comparably fast and inexpensive, and many of them can be applied prior to synthesis and in vitro testing of new chemicals. Computational methods such as machine learning (ML), similarity search, and structural alerts are already in use during the development of new chemicals. They are often restrained by limitations in data availability for training and by the need for applicability domain determination for predictions on new data. In this thesis, novel in silico strategies for guiding in vivo and in vitro toxicity testing were developed. Means to maximise the gain from limited available data were explored, as well as strategies to improve the applicability of in silico toxicity prediction approaches. A special focus was laid on studying the potential of the conformal prediction (CP) framework, which is built on top of an ML model, to allow for confidence estimation. The CP framework utilises an extra calibration set to compare the predicted probabilities of new query compounds to those previously seen. The calibrated probabilities are returned in the form of so-called p-values. To support toxicity testing of new chemicals, the Python-based KnowTox pipeline was developed. Following a holistic approach, a compound of interest can in silico be examined from three perspectives: KnowTox searches for known toxic substructures, returns how similar compounds were tested in vitro, and queries pre-trained CP models. The value of complementing the outputs of different in silico approaches was demonstrated in a retrospective case study on two triazole molecules from industry. Focusing on the estrogen receptor (ER), an important target for endocrine disruption, we further explored whether in silico predictions can help to pre-select compounds for in vitro experiments. Starting from nine newly discovered ER active compounds (using the recently-developed E-Morph Screen ER assay), it was prospectively shown that similarity search and CP models can help to increase the hit rate of in vitro screens, enabling fast and efficient identification of novel endocrine disruptors. In the above described studies, CP was used as it outputs valid confidence estimates and guarantees pre-defined error rates (on the exchangeability assumption). Moreover, allowing class-wise calibration, data imbalances are usually well-handled. The potential of CP was, in this thesis, further investigated for the generation of bioactivity descriptors, and to mitigate data drift effects. The ChemBioSim project addressed the challenge of predicting in vivo toxicological effects by informing CP models with bioactivity descriptors originating from in vitro data. Compared to chemical descriptors, bioactivity descriptors may provide more mechanistic information and help to better capture complex in vivo outcomes. To avoid in vitro testing of every query molecule, p-values returned by CP models trained on in vitro datasets served as bioactivity descriptors. For the investigated MNT and cardiotoxicity endpoints, in vivo toxicity prediction could be improved by using bioactivity (instead of chemical) descriptors. The CP framework is designed to yield valid predictions, provided that training and test set are exchangeable. This assumption is not always fulfilled; data drifts may occur e.g. when the chemical space or assay conditions change. To mitigate effects of data drifts, we have developed a recalibration strategy, suggesting to exchange the calibration set with data closer to the test data. The strategy, developed based on the Tox21 data, was further analysed for temporal data drifts using ChEMBL data and for differences between public and proprietary data. In most cases, recalibration led to restored validity, a prerequisite for model applicability. Besides applications of computational toxicity prediction methods and CP, this thesis further discusses general aspects of data and applicability domain in the context of in silico toxicology. While regulatory agencies still require animal studies, with the computational strategies discussed in this work, we aim to foster the reliability of predictions and the applicability of models, to ultimately reduce animal testing.Synthetische Chemikalien sind in unserem täglichen Leben allgegenwärtig, was die Untersuchung neuer Chemikalien auf toxische Effekte unerlässlich macht. Toxikologische Untersuchungen werden traditionellerweise anhand von in vitro und in vivo Studien durchgeführt. Jedoch fordern ethische und wirtschaftliche Argumente die Reduktion und letztendlich den Ersatz von Tierversuchen. Daher hat die computergestützte Toxizitätsvorhersage an Bedeutung gewonnen, um Toxizitätsstudien zu unterstützen und letztlich Tierversuche zu reduzieren. In silico Methoden sind vergleichsweise schnell und günstig, und viele von ihnen können vor der Synthese und in vitro-Prüfung neuer Chemikalien angewendet werden. Computergestützte Methoden wie Maschinelles Lernverfahren (ML), Ähnlichkeitssuche und Substruktursuche werden im Entwicklungsprozess neuer Chemikalien bereits angewendet. Sie stossen jedoch oft an ihre Grenzen. Ein Grund dafür ist die limitierte Datenverfügbarkeit, ein anderer die Gewährleistung der Anwendbarkeit der Modelle. Im Zuge dieser Arbeit wurden neuartige in silico Strategien zur Steuerung von in vivo und in vitro Versuchen entwickelt. Es wurden Strategien zur Maximierung des Nutzens aus den begrenzt verfügbaren Daten sowie Strategien zur Verbesserung der Anwendbarkeit von in silico-Toxizitätsvorhersageansätzen untersucht. Ein besonderer Schwerpunkt der Arbeit lag auf der Untersuchung des Potenzials des Conformal Prediction (CP) Frameworks, das auf einem ML-Modell aufbaut, um eine Vetrauensabschätzung zu ermöglichen. CP verwendet ein zusätzliches Kalibrierungsset, mithilfe dessen die von ML Modellen vorhergesagten Wahrscheinlichkeiten für neue Moleküle kalibriert werden. Die Kalibrierung erfolgt anhand von Vorhersagen für bereits bekannte Moleküle und die kalibrierten Wahrscheinlichkeiten werden als sogenannte p-Werte zurückzugeben. Um das Planen von toxikologischen Studien und die Risikobeurteilung von Chemikalien zu unterstützen, wurde die Python-basierte KnowTox Pipeline entwickelt. KnowTox verfolgt einen ganzheitlichen Ansatz, bei dem eine neue Substanz aus drei Perspektiven in silico beurteilt wird: KnowTox sucht nach bekannten unerwünschten Substrukturen, ermittelt wie ähnliche Substanzen in vitro getestet wurden, und es werden Vorhersagen mit vortrainierten CP Modellen gemacht. In einer retrospektiven Fallstudie mit zwei ehemaligen Entwicklungskandidaten aus der Industrie konnte der Nutzen des Kombinierens verschiedener in silico Methoden aufgezeigt werden. Unsere nächste Studie konzentrierte sich auf den Östrogenrezeptor (ER), einen wichtigen Angriffspunkt für hormonaktive Substanzen. Es wurde untersucht, ob in silico Vorhersagen auch bei der Vorselektionierung von Testsubstanzen für in vitro Versuche nützlich sein können. Anhand von neun Substanzen, die mithilfe des kürzlich entwickelten E-Morph Screen ER Assays als ER-aktiv eingestuft worden sind, konnte prospektiv gezeigt werden, wie Ähnlichkeitssuche und CP-Modelle die Trefferquote von in vitro Screeningverfahren erhöhen können, was eine schnellere und effizientere Identifizierung neuartiger Endokriner Disruptoren ermöglicht. In den oben beschriebenen Studien wurde die CP-Methode gewählt, weil sie valide Vertrauensabschätzungen macht und vordefinierte Fehlerraten garantiert. Zusätzlich kann CP durch klassenweise Kalibrierung gut mit den für toxikologische Datensätze üblichen Ungleichgewichten zwischen der Anzahl aktiver und inaktiver Moleküle umgehen. Desweiteren wurde in dieser Arbeit das Potenzial der CP-Methode für die Generierung von Bioaktivitäts-Deskriptoren und zur Abschwächung von Datendrifteffekten untersucht. Das ChemBioSim Projekt befasste sich mit der Herausforderung, toxikologische in vivo Effekte vorherzusagen, indem CP-Modelle mit Bioaktivitäts-Deskriptoren aus in vitro-Daten angereichert wurden. Im Vergleich zu chemischen Deskriptoren, könnten Bioaktivitäts-Deskriptoren mehr mechanistische Informationen enthalten und helfen, komplexe in vivo-Endpunkte besser zu erfassen. Um zu vermeiden, dass jedes vorhergesagte Molekül auch synthetisiert und in vitro getestet werden muss, wurden CP Modelle auf in vitro Datensätzen trainiert und die ausgegebenen p-Werte als Bioaktivitäts-Deskriptoren verwendet. Für die untersuchten MNT- und Kardiotoxizitäts-Endpunkte konnte die Vorhersage der in vivo Toxizität mithilfe der Bioaktivitätsdeskriptoren, im Vergleich zu chemischen Deskriptoren, verbessert werden. Das CP Framework wurde so konzipiert, dass die Modelle gültige Vorhersagen liefern, vorausgesetzt, dass Trainings- und Testdatensatz austauschbar sind. Diese Annahme ist jedoch nicht immer erfüllt. Es kann zum Beispiel zu Datendrifts kommen, wenn sich der chemische Raum oder die Assay-Bedingungen ändern. Um die Auswirkungen solcher Datendrifte abzuschwächen, haben wir eine sogenannte `Rekalibrierungs-Strategie' entwickelt, bei der das Kalibrierungsset durch neue Daten ersetzt wird, die näher am Testdatensatz liegen. Die Strategie wurde anhand der Tox21 Datensätze entwickelt und anschliessend weiter für die Anwendung auf temporale Datendrifts sowie auf Unterschiede zwischen öffentlichen und proprietären Daten untersucht. In den meisten Fällen führte die Rekalibrierung zur Wiederherstellung der Validität, eine Voraussetzung für die Anwendbarkeit des Modells. Neben den Anwendungen von computergestützten Methoden und CP zur Vorhersage der Toxizität, werden in dieser Arbeit auch allgemeine Aspekte der Daten und der Anwendbarkeit im Kontext der in silico Toxikologie diskutiert. Während die Aufsichtsbehörden nach wie vor Tierversuche verlangen, zielen die in dieser Arbeit erörterten Strategien darauf ab, die Zuverlässigkeit der Vorhersagen und die Anwendbarkeit der Modelle zu verbessern, um letztendlich Tierversuche zu reduzieren

    Assessing the calibration in toxicological in vitro models with conformal prediction

    Get PDF
    Machine learning methods are widely used in drug discovery and toxicity prediction. While showing overall good performance in cross-validation studies, their predictive power (often) drops in cases where the query samples have drifted from the training data's descriptor space. Thus, the assumption for applying machine learning algorithms, that training and test data stem from the same distribution, might not always be fulfilled. In this work, conformal prediction is used to assess the calibration of the models. Deviations from the expected error may indicate that training and test data originate from different distributions. Exemplified on the Tox21 datasets, composed of chronologically released Tox21Train, Tox21Test and Tox21Score subsets, we observed that while internally valid models could be trained using cross-validation on Tox21Train, predictions on the external Tox21Score data resulted in higher error rates than expected. To improve the prediction on the external sets, a strategy exchanging the calibration set with more recent data, such as Tox21Test, has successfully been introduced. We conclude that conformal prediction can be used to diagnose data drifts and other issues related to model calibration. The proposed improvement strategy-exchanging the calibration data only-is convenient as it does not require retraining of the underlying model.De två sista författarna delar sistaförfattarskapet</p

    Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

    Get PDF
    Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models

    A Machine Learning Framework to Improve Rat Clearance Predictions and Inform Physiologically Based Pharmacokinetic Modeling

    No full text
    During drug discovery and development, achieving appropriate pharmacokinetics is key to establishment of the efficacy and safety of new drugs. Physiologically based pharmacokinetic (PBPK) models integrating in vitro-to-in vivo extrapolation have become an essential in silico tool to achieve this goal. In this context, the most important and probably most challenging pharmacokinetic parameter to estimate is the clearance. Recent work on high-throughput PBPK modeling during drug discovery has shown that a good estimate of the unbound intrinsic clearance (CLint,u,) is the key factor for useful PBPK application. In this work, three different machine learning-based strategies were explored to predict the rat CLint,u as the input into PBPK. Therefore, in vivo and in vitro data was collected for a total of 2639 proprietary compounds. The strategies were compared to the standard in vitro bottom-up approach. Using the well-stirred liver model to back-calculate in vivo CLint,u from in vivo rat clearance and then training a machine learning model on this CLint,u led to more accurate clearance predictions (absolute average fold error (AAFE) 3.1 in temporal cross-validation) than the bottom-up approach (AAFE 3.6-16, depending on the scaling method) and has the advantage that no experimental in vitro data is needed. However, building a machine learning model on the bias between the back-calculated in vivo CLint,u and the bottom-up scaled in vitro CLint,u also performed well. For example, using unbound hepatocyte scaling, adding the bias prediction improved the AAFE in the temporal cross-validation from 16 for bottom-up to 2.9 together with the bias prediction. Similarly, the log Pearson r2 improved from 0.1 to 0.29. Although it would still require in vitro measurement of CLint,u., using unbound scaling for the bottom-up approach, the need for correction of the fu,inc by fu,p data is circumvented. While the above-described ML models were built on all data points available per approach, it is discussed that evaluation comparison across all approaches could only be performed on a subset because ca. 75% of the molecules had missing or unquantifiable measurements of the fraction unbound in plasma or in vitro unbound intrinsic clearance, or they dropped out due to the blood-flow limitation assumed by the well-stirred model. Advantageously, by predicting CLint,u as the input into PBPK, existing workflows can be reused and the prediction of the in vivo clearance and other PK parameters can be improved

    Strong Impact of Smoking on Multimorbidity and Cardiovascular Risk Among Human Immunodeficiency Virus-Infected Individuals in Comparison With the General Population.

    Get PDF
    Background.  Although acquired immune deficiency syndrome-associated morbidity has diminished due to excellent viral control, multimorbidity may be increasing among human immunodeficiency virus (HIV)-infected persons compared with the general population. Methods.  We assessed the prevalence of comorbidities and multimorbidity in participants of the Swiss HIV Cohort Study (SHCS) compared with the population-based CoLaus study and the primary care-based FIRE (Family Medicine ICPC-Research using Electronic Medical Records) records. The incidence of the respective endpoints were assessed among SHCS and CoLaus participants. Poisson regression models were adjusted for age, sex, body mass index, and smoking. Results.  Overall, 74 291 participants contributed data to prevalence analyses (3230 HIV-infected; 71 061 controls). In CoLaus, FIRE, and SHCS, multimorbidity was present among 26%, 13%, and 27% of participants. Compared with nonsmoking individuals from CoLaus, the incidence of cardiovascular disease was elevated among smoking individuals but independent of HIV status (HIV-negative smoking: incidence rate ratio [IRR] = 1.7, 95% confidence interval [CI] = 1.2-2.5; HIV-positive smoking: IRR = 1.7, 95% CI = 1.1-2.6; HIV-positive nonsmoking: IRR = 0.79, 95% CI = 0.44-1.4). Compared with nonsmoking HIV-negative persons, multivariable Poisson regression identified associations of HIV infection with hypertension (nonsmoking: IRR = 1.9, 95% CI = 1.5-2.4; smoking: IRR = 2.0, 95% CI = 1.6-2.4), kidney (nonsmoking: IRR = 2.7, 95% CI = 1.9-3.8; smoking: IRR = 2.6, 95% CI = 1.9-3.6), and liver disease (nonsmoking: IRR = 1.8, 95% CI = 1.4-2.4; smoking: IRR = 1.7, 95% CI = 1.4-2.2). No evidence was found for an association of HIV-infection or smoking with diabetes mellitus. Conclusions.  Multimorbidity is more prevalent and incident in HIV-positive compared with HIV-negative individuals. Smoking, but not HIV status, has a strong impact on cardiovascular risk and multimorbidity
    corecore