257 research outputs found

    Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

    Get PDF
    The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data

    Three machine learning models for the 2019 Solubility Challenge

    Get PDF
    We describe three machine learning models submitted to the 2019 Solubility Challenge. All are founded on tree-like classifiers, with one model being based on Random Forest and another on the related Extra Trees algorithm. The third model is a consensus predictor combining the former two with a Bagging classifier. We call this consensus classifier Vox Machinarum, and here discuss how it benefits from the Wisdom of Crowds. On the first 2019 Solubility Challenge test set of 100 low-variance intrinsic aqueous solubilities, Extra Trees is our best classifier. One the other, a high-variance set of 32 molecules, we find that Vox Machinarum and Random Forest both perform a little better than Extra Trees, and almost equally to one another. We also compare the gold standard solubilities from the 2019 Solubility Challenge with a set of literature-based solubilities for most of the same compounds.Publisher PDFPeer reviewe

    Solvation thermodynamics of organic molecules by the molecular integral equation theory : approaching chemical accuracy

    Get PDF
    The integral equation theory (IET) of molecular liquids has been an active area of academic research in theoretical and computational physical chemistry for over 40 years because it provides a consistent theoretical framework to describe the structural and thermodynamic properties of liquid-phase solutions. The theory can describe pure and mixed solvent systems (including anisotropic and nonequilibrium systems) and has already been used for theoretical studies of a vast range of problems in chemical physics / physical chemistry, molecular biology, colloids, soft matter, and electrochemistry. A consider- able advantage of IET is that it can be used to study speci fi c solute − solvent interactions, unlike continuum solvent models, but yet it requires considerably less computational expense than explicit solvent simulations

    Can small drugs predict the intrinsic aqueous solubility of ‘beyond Rule of 5’ big drugs?

    Get PDF
    The aim of the study was to explore to what extent small molecules (mostly from the Rule of 5 chemical space) can be used to predict the intrinsic aqueous solubility, S0, of big molecules from beyond the Rule of 5 (bRo5) space. It was demonstrated that the General Solubility Equation (GSE) and the Abraham Solvation Equation (ABSOLV) underpredict solubility in systematic but slightly ways. The Random Forest regression (RFR) method predicts solubility more accurately, albeit in the manner of a ‘black box.’ It was discovered that the GSE improves considerably in the case of big molecules when the coefficient of the log P term (octanol-water partition coefficient) in the equation is set to -0.4 instead of the traditional -1 value. The traditional GSE underpredicts solubility for molecules with experimental S0 < 50 ”M. In contrast, the ABSOLV equation (trained with small molecules) underpredicts the solubility of big molecules in all cases tested. It was found that the errors in the ABSOLV-predicted solubilities of big molecules correlate linearly with the number of rotatable bonds, which suggests that flexibility may be an important factor in differentiating solubility of small from big molecules. Notably, most of the 31 big molecules considered have negative enthalpy of solution: these big molecules become less soluble with increasing temperature, which is compatible with ‘molecular chameleon’ behavior associated with intramolecular hydrogen bonding. The X‑ray structures of many of these molecules reveal void spaces in their crystal lattices large enough to accommodate many water molecules when such solids are in contact with aqueous media. The water sorbed into crystals suspended in aqueous solution may enhance solubility by way of intra-lattice solute-water interactions involving the numerous H‑bond acceptors in the big molecules studied. A ‘Solubility Enhancement–Big Molecules’ index was defined, which embodies many of the above findings.</p

    ADME prediction with KNIME: In silico aqueous solubility consensus model based on supervised recursive random forest approaches

    Get PDF
    In-silico prediction of aqueous solubility plays an important role during the drug discovery and development processes. For many years, the limited performance of in-silico solubility models has been attributed to the lack of high-quality solubility data for pharmaceutical molecules. However, some studies suggest that the poor accuracy of solubility prediction is not related to the quality of the experimental data and that more precise methodologies (algorithms and/or set of descriptors) are required for predicting aqueous solubility for pharmaceutical molecules. In this study a large and diverse database was generated with aqueous solubility values collected from two public sources; two new recursive machine-learning approaches were developed for data cleaning and variable selection, and a consensus model based on regression and classification algorithms was created. The modeling protocol, which includes the curation of chemical and experimental data, was implemented in KNIME, with the aim of obtaining an automated workflow for the prediction of new databases. Finally, we compared several methods or models available in the literature with our consensus model, showing results comparable or even outperforming previous published models.  </p

    Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models

    Get PDF
    Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge"in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets

    Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks

    Full text link
    Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at \url{https://github.com/ur-whitelab/mol.dev}, and the model is usable at \url{https://mol.dev}

    Solubility Temperature Dependence Predicted from 2D Structure

    Get PDF
    The objective of the study was to find a computational procedure to normalize solubility data determined at various temperatures (e.g., 10 – 50 oC) to values at a “reference” temperature (e.g., 25 °C). A simple procedure was devised to predict enthalpies of solution, ΔHsol, from which the temperature dependence of intrinsic (uncharged form) solubility, log S0, could be calculated. As dependent variables, values of ΔHsol at 25 °C were subjected to multiple linear regression (MLR) analysis, using melting points (mp) and Abraham solvation descriptors. Also, the enthalpy data were subjected to random forest regression (RFR) and recursive partition tree (RPT) analyses. A total of 626 molecules were examined, drawing on 2040 published solubility values measured at various temperatures, along with 77 direct calori metric measurements. The three different prediction methods (RFR, RPT, MLR) all indicated that the estimated standard deviations in the enthalpy data are 11-15 kJ mol-1, which is concordant with the 10 kJ mol-1 propagation error estimated from solubility measurements (assuming 0.05 log S errors), and consistent with the 7 kJ mol-1 average reproducibility in enthalpy values from interlaboratory replicates. According to the MLR model, higher values of mp, H-bond acidity, polarizability/dipolarity, and dispersion forces relate to more positive (endothermic) enthalpy values. However, molecules that are large and have high H-bond basicity are likely to possess negative (exothermic) enthalpies of solution. With log S0 values normalized to 25 oC, it was shown that the interlaboratory average standard deviations in solubility measurement are reduced to 0.06 - 0.17 log unit, with higher errors for the least-soluble druglike molecules. Such improvements in data mining are expected to contribute to more reliable in silico prediction models of solubility for use in drug discovery

    The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS

    Get PDF
    BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure-activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with &gt;200,000,000,000 entries and parallel calculations using 32&nbsp;&times;&nbsp;6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32&nbsp;&deg;C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure-activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826
    • 

    corecore