Solubility prediction in water and organic solvents through a combination of chemometrics and computational chemistry

Abstract

Accurate solubility prediction is crucial across a range of scientific disciplines including drug discovery, protein engineering, drug and agrochemical process design, biochemistry, route prediction, crystallisation, and extraction. We herein report a successful approach to predicting solubility, not only in water but also in organic solvents (ethanol, benzene, and acetone), using a combination of machine learning and computational chemistry. Our new approach, named Causal Structure Property Relationship (CSPR), allowed examination of the physical chemistry behind dissolution to choose a small number of chemically relevant descriptors to produce highly interpretable models. These models gave significantly more accurate predictions than leading open-source and commercial solubility prediction tools, achieving accuracy (60-80 %) close to the expected level of noise in the training data (LogS±0.7). By reproducing the physicochemical relationship between solubility and molecular properties in different solvents, rational improvements to the models were explored. Subsequent improvements to the models included modifying the solvation energy and combining machine learning methods to provide a consensus prediction. A larger dataset in water provided the basis for the discussion of pKa and speciation in water. We conclude that gathering accurate solubility data across a range of solvents is crucial to expanding this work and promoting sustainable chemistry in the future. It is our hope that this methodology will be applied to other problems in chemistry and that our open-access datasets (the first of its kind for benzene and acetone) will stimulate further research in this field

    Similar works

    Full text

    thumbnail-image

    Available Versions