71 research outputs found
Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowskyâs general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly whatâs missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data
Multi-lab intrinsic solubility measurement reproducibility in CheqSol and shake-flask methods
This commentary compares 233 CheqSol intrinsic solubility values (log S0) reported in the Wiki-pS0 database for 145 different druglike molecules to the 838 log S0 values determined mostly by the saturation shake-flask (SSF) method for 124 of the molecules from the CheqSol set. The range of log S0 spans from -1.0 to
-10.6 (log molar units), averaging at -3.8. The correlation plot between the two methods indicates r2 = 0.96, RMSE = 0.34 log unit, and a slight bias of -0.07 log unit. The average interlaboratory standard deviation (SDi) is slightly better for the CheqSol set than that of the SSF set: SDiCS = 0.15 and SDiSSF = 0.24. The intralaboratory errors reported in the CheqSol method (0.05 log) need to be multiplied by a factor of 3 to match the expected interlaboratory errors for the method. The scale factor, in part, relates to the hidden systematic errors in the single-lab values. It is expected that improved standardizations in the âgold standardâ SSF method, as suggested in the recent âwhite paperâ on solubility measurement methodology, should make the SDi of both methods be about ~0.15 log unit. The multi-lab averaged log S0 (and the corresponding SDi) values could be helpful additions to existing training-set molecules used to predict the intrinsic solubility of drugs and druglike molecules
Do you know your r2?
The prediction of solubility of drugs usually calls on the use of several open-source/commercially-available computer programs in the various calculation steps. Popular statistics to indicate the strength of the prediction model include the coefficient of determination (r2), Pearsonâs linear correlation coefficient (rPearson), and the root-mean-square error (RMSE), among many others. When a program calculates these statistics, slightly different definitions may be used. This commentary briefly reviews the definitions of three types of r2 and RMSE statistics (model validation, bias compensation, and Pearson) and how systematic errors due to shortcomings in solubility prediction models can be differently indicated by the choice of statistical indices. The indices we have employed in recently published papers on the prediction of solubility of druglike molecules were unclear, especially in cases of drugs from âbeyond the Rule of 5â chemical space, as simple prediction models showed distinctive âbias-tiltâ systematic type scatter
Solubility Temperature Dependence Predicted from 2D Structure
The objective of the study was to find a computational procedure to normalize solubility data determined at various temperatures (e.g., 10 â 50 oC) to values at a âreferenceâ temperature (e.g., 25 °C). A simple procedure was devised to predict enthalpies of solution, ÎHsol, from which the temperature dependence of intrinsic (uncharged form) solubility, log S0, could be calculated. As dependent variables, values of ÎHsol at 25 °C were subjected to multiple linear regression (MLR) analysis, using melting points (mp) and Abraham solvation descriptors. Also, the enthalpy data were subjected to random forest regression (RFR) and recursive partition tree (RPT) analyses. A total of 626 molecules were examined, drawing on 2040 published solubility values measured at various temperatures, along with 77 direct calori metric measurements. The three different prediction methods (RFR, RPT, MLR) all indicated that the estimated standard deviations in the enthalpy data are 11-15 kJ mol-1, which is concordant with the 10 kJ mol-1 propagation error estimated from solubility measurements (assuming 0.05 log S errors), and consistent with the 7 kJ mol-1 average reproducibility in enthalpy values from interlaboratory replicates. According to the MLR model, higher values of mp, H-bond acidity, polarizability/dipolarity, and dispersion forces relate to more positive (endothermic) enthalpy values. However, molecules that are large and have high H-bond basicity are likely to possess negative (exothermic) enthalpies of solution. With log S0 values normalized to 25 oC, it was shown that the interlaboratory average standard deviations in solubility measurement are reduced to 0.06 - 0.17 log unit, with higher errors for the least-soluble druglike molecules. Such improvements in data mining are expected to contribute to more reliable in silico prediction models of solubility for use in drug discovery
Can small drugs predict the intrinsic aqueous solubility of âbeyond Rule of 5â big drugs?
The aim of the study was to explore to what extent small molecules (mostly from the Rule of 5 chemical space) can be used to predict the intrinsic aqueous solubility, S0, of big molecules from beyond the Rule of 5 (bRo5) space. It was demonstrated that the General Solubility Equation (GSE) and the Abraham Solvation Equation (ABSOLV) underpredict solubility in systematic but slightly ways. The Random Forest regression (RFR) method predicts solubility more accurately, albeit in the manner of a âblack box.â It was discovered that the GSE improves considerably in the case of big molecules when the coefficient of the log P term (octanol-water partition coefficient) in the equation is set to -0.4 instead of the traditional -1 value. The traditional GSE underpredicts solubility for molecules with experimental S0 < 50 ”M. In contrast, the ABSOLV equation (trained with small molecules) underpredicts the solubility of big molecules in all cases tested. It was found that the errors in the ABSOLV-predicted solubilities of big molecules correlate linearly with the number of rotatable bonds, which suggests that flexibility may be an important factor in differentiating solubility of small from big molecules. Notably, most of the 31 big molecules considered have negative enthalpy of solution: these big molecules become less soluble with increasing temperature, which is compatible with âmolecular chameleonâ behavior associated with intramolecular hydrogen bonding. The Xâray structures of many of these molecules reveal void spaces in their crystal lattices large enough to accommodate many water molecules when such solids are in contact with aqueous media. The water sorbed into crystals suspended in aqueous solution may enhance solubility by way of intra-lattice solute-water interactions involving the numerous Hâbond acceptors in the big molecules studied. A âSolubility EnhancementâBig Moleculesâ index was defined, which embodies many of the above findings.</p
Mechanistically transparent models for predicting aqueous soluÂŹbility of rigid, slightly flexible, and very flexible drugs (MW<2000) Accuracy near that of random forest regression Alex Avdeef
Yalkowskyâs General Solubility Equation (GSE), with its three fixed constants, is popular and easy to apply, but is not very accurate for polar, zwitterionic, or flexible molecules. This review examines the findings of a series of studies, where we have sought to come up with a better prediction model, by comparing the performances of the GSE to Abrahamâs Solvation Equation (ABSOLV), and Random Forest regression (RFR) machine-learning (ML) method. Large, well-curated aqueous intrinsic solubility databases are available. However, drugs may be sparsely distributed in chemical space, concentrated in clusters. Even a large database might overlook some regions. Test compounds from under-represented portions of space may be poorly predicted, as might be the case with the âlooseâ set of 32 drugs in the Second Solubility Challenge (2020). There appears to be still a need for better coverage of drug space. Increasingly, current trends in predictions of solubility use calculated input descriptors, which may be an advantage for exploring properties of molecules yet to be synthesized. The risk may be that overall prediction approaches might be based on accumulated uncertainty. The increasing use of ML/AI methods can lead to accurate predictions, but such predictions may not readily suggest the strategies to pursue in selecting yet-to-be-synthesized compounds. Based on our latest findings, we recommend predictions based on both âgroupedâ ABSOLV(GRP) and âFlexible Acceptorâ GSE(Ί,B) models with the provided best-fit parameters, where Ί is the Kier molecular flexibility index and B is the Abraham H-bond acceptor strength. For molecules with Ί < 11, the prudent choice is to pick the Consensus Model, the average of ABSOLV(GRP) and GSE(Ί,B). For more flexible molecules, GSE(Ί,B) is recommended
Anomalous Solubility Behavior of Several Acidic Drugs
The âanomalous solubility behavior at higher pH valuesâ of several acidic drugs originally studied by Higuchi et al. in 1953 [1], but hitherto not fully rationalized, has been re-analyzed using a novel solubility-pH analysis computer program, pDISOL-XTM. The program internally derives implicit solubility equations, given a set of proposed equilibria and constants (iteratively refined by weighted nonlinear regression), and does not require explicit Henderson-Hasselbalch equations. The re-analyzed original barbital, phenobarbital, oxytetracycline, and sulfathiazole solubility-pH data of Higuchi et al. is consistent with the presence of dimers in saturated solutions. In the case of barbital, phenobarbital and sulfathiazole, anionic dimers, reaching peak concentrations near pH 8. However, oxytetracycline indicated a pronounced tendency to form a cationic dimer, peaking near pH 2. Under the conditions of the original study, only barbital indicated a slight tendency to form a salt precipitate at pH > 6.8, with a highly unusual stoichiometry (consistent with a slope of 0.55 in the log S â pH plot): K+ + A2H- + 3HA KA5H4(s). Thus the âanomalyâ in the Higuchi data can be rationalized by invoking specific aggregated species
Anomalous salting-out, self-association and pKa effects in the practically-insoluble bromothymol blue
Background and Purpose: The widely-used and practically insoluble diprotic acidic dye, bromothymol blue (BTB), is a neutral molecule in strongly acidic aqueous solutions. The Schill (1964) extensive solubility-pH measurement of bromothymol blue in 0.1 and 1.0 M NaCl solutions, with pH adjusted with HCl from 0.0 to 5.4, featured several unusual findings. The data suggest that the difference in solubility of the neutral-form molecule in 1M NaCl is more than 0.7 log unit lower than the solubility in pure water. This could be considered as uncharacteristically high for a salting-out effect. Also, the study reported two apparent values of pKa1, 1.48 and 1.00, in 0.1 M and 1.0 M NaCl solutions, respectively. The only other measured value found for pKa1 in the literature is -0.66 (Gupta and Cadwallader, 1968). Experimental Approach: It was reasoned that the there can be only a single pKa1 for BTB. Also, it was hypothesized that salting-out alone might not account for such a large difference in solubility observed at the two levels of salt. A generalized mass action approach incorporating activity corrections for charged species using the Stokes-Robinson hydration equation and for neutral species using the Setschenow equation, was selected to analyze the Schill solubility-pH data to seek a rationalization of these unusual results. Key Results: BTB reveals complex speciation chemistry in saturated aqueous solutions which had been poorly understood for many years. The appearÂance of two different values of pKa1 at different levels of NaCl and the anomalously high value of the empirical salting-out constant could be rationalized to normal values by invoking the formation of a very stable neutral dimer (log K2 = 10.0 ± 0.1 M-1). A ânormalâ salting-out constant, 0.25 M-1 was then derived. It was also possible to estimate the âself-interactionâ constant.  The data analysis in the present study critically depended on the pKa1 = -0.66 reported by Gupta and Cadwallader. Conclusion: A more reasonable salting-out constant and a consistent single value for pKa1 have been determined by considering a self-interacting (aggregation) model involving an uncharged form of the molecule, which is likely a zwitterion, as suggested by literature spectrophotometric studies
Phosphate Precipitates and Water-Soluble Aggregates in Re analyzed Solubility-pH Data of Twenty-five Basic Drugs
The purpose of the study was to assess the stoichiometries of phosphate precipitates and determine the intrinsic solubilities, S0, of 25 basic drugs from their published solubility-pH profiles in the landmark study of Bergström et al. (2004), where 0.15 M phosphate buffer media had been used. A secondary purpose of this study was to attempt to predict phosphate 1:1 and 2:1 solubility products, Ksp, from knowledge of S0. The published data have been re-analyzed using a novel solubility-pH analysis computer program, pDISOL XTM. The program internally derives implicit solubility equations, given a set of proposed equilibria and constants (which are then iteratively refined by weighted nonlinear regression), and does not require explicit Henderson-Hasselbalch equations. The data were tested for the presence of phosphate precipitates of various stoichiometries, as well as the simultaneous presence of aggregated species, either cationic or neutral. The presence of particular species was suggested by the slope characteristics of the log S vs. pH curves. Considerably different intrinsic solubility constants were found, compared to those originally reported, for several drugs (e.g., celiprolol, desipramine, haloperidol). The least soluble molecule, amiodarone, analyzed to have the extraordinarily low intrinsic solubility of 2 picograms/mL, a moderate salt solubility of 0.82 mg/mL at the Gibbs pKa 5.4, corresponding to the species BHâH2PO4(s), and a substantial presence of the positively-charged pentameric aggregate, (BH)5
- âŠ