5 research outputs found
Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Hansen Solubility Parameters Based on 1D and 2D Molecular Descriptors Computed from SMILES String
A new method of Hansen solubility parameters (HSPs) prediction was developed
by combining the multivariate adaptive regression splines (MARSplines)
methodology with a simple multivariable regression involving 1D and 2D PaDEL
molecular descriptors. In order to adopt the MARSplines approach to QSPR/QSAR
problems, several optimization procedures were proposed and tested. The
effectiveness of the obtained models was checked via standard QSPR/QSAR
internal validation procedures provided by the QSARINS software and by
predicting the solubility classification of polymers and drug-like solid
solutes in collections of solvents. By utilizing information derived only from
SMILES strings, the obtained models allow for computing all of the three Hansen
solubility parameters including dispersion, polarization, and hydrogen bonding.
Although several descriptors are required for proper parameters estimation, the
proposed procedure is simple and straightforward and does not require a
molecular geometry optimization. The obtained HSP values are highly correlated
with experimental data, and their application for solving solubility problems
leads to essentially the same quality as for the original parameters. Based on
provided models, it is possible to characterize any solvent and liquid solute
for which HSP data are unavailable
A confidence predictor for logD using conformal regression and a support-vector machine
Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service
RDF Dataset for article: A confidence predictor for logD using conformal regression and a support-vector machine
RDF dataset described in article: "A confidence predictor for logD using conformal regression and a support-vector machine" (Manuscript in preparation).
The dataset contains conformal logD values at 90% confidence level, computed for 91M compounds from PubChem, in RDF format.
The .hdt.gz version contains the dataset in RDF HDT format (http://www.rdfhdt.org/), compressed with tar and gzip. The archive contains both the .hdt file, and an index file, generated by the hdtSearch C++ tool.
The .ttl.gz file is a gzipped file in RDF Turtle format (https://www.w3.org/TR/turtle/)