3 research outputs found

    Fusion of molecular representations and prediction of biological activity using convolutional neural network and transfer learning

    Get PDF
    Basic structural features and physicochemical properties of chemical molecules determine their behaviour during chemical, physical, biological and environmental processes and hence need to be investigated for determining and modelling the actions of the molecule. Computational approaches such as machine learning methods are alternatives to predict physiochemical properties of molecules based on their structures. However, limited accuracy and error rates of these predictions restrict their use. This study developed three classes of new methods based on deep learning convolutional neural network for bioactivity prediction of chemical compounds. The molecules are represented as a convolutional neural network (CNN) with new matrix format to represent the molecular structures. The first class of methods involved the introduction of three new molecular descriptors, namely Mol2toxicophore based on molecular interaction with toxicophores features, Mol2Fgs based on distributed representation for constructing abstract features maps of a selected set of small molecules, and Mol2mat, which is a molecular matrix representation adapted from the well-known 2D-fingerprint descriptors. The second class of methods was based on merging multi-CNN models that combined all the molecular representations. The third class of methods was based on automatic learning of features using values within the neurons of the last layer in the proposed CNN architecture. To evaluate the performance of the methods, a series of experiments were conducted using two standard datasets, namely MDL Drug Data Report (MDDR) and Sutherland datasets. The MDDR datasets comprised 10 homogeneous and 10 heterogeneous activity classes, whilst Sutherland datasets comprised four homogeneous activity classes. Based on the experiments, the Mol2toxicophore showed satisfactory prediction rates of 92% and 80% for homogeneous and heterogeneous activity classes, respectively. The Mol2Fgs was better than Mol2toxicophore with prediction accuracy result of 95% for homogeneous and 90% for heterogeneous activity classes. The Mol2mat molecular representation had the highest prediction accuracy with 97% and 94% for homogeneous and heterogeneous datasets, respectively. The combined multi-CNN model leveraging on the knowledge acquired from the three molecular presentations produced better accuracy rate of 99% for the homogeneous and 98% for heterogeneous datasets. In terms of molecular similarity measure, use of the values in the neurons of the last hidden layer as the automatically learned feature in the multi-CNN model as a novel molecular learning representation was found to perform well with 88.6% in terms of average recall value in 5% structures most similar to the target search. The results have demonstrated that the newly developed methods can be effectively used for bioactivity prediction and molecular similarity searching

    SMARTS Approach to Chemical Data Mining and Physicochemical Property Prediction.

    Full text link
    The calculation of physicochemical and biological properties is essential in order to facilitate modern drug discovery. Chemical spaces dimensionalized by these descriptors have been used to scaffold-hop in order to discover new lead and drug-like molecules. Broadening the boundaries of structure based drug design, these molecules are expected to share the same physiological target and have similar efficacy, as do known drug molecules sharing the same region in chemical property space. In the past few decades physicochemical and ADMET (absorption, distribution, metabolism, elimination, and toxicity) property predictors have been the subject of increased focus in academia and the pharmaceutical industry. Due to the ever increasing attention given to data mining and property predictions, we first discuss the sources of experimental pKa values and current methodologies used for pKa prediction in proteins and small molecules. Of particular concern is an analysis of the scope, statistical validity, overall accuracy, and predictive power of these methods. The expressed concerns are not limited to predicting pKa, but apply to all empirical predictive methodologies. In a bottom-up approach, we explored the influence of freely generated SMARTS string representations of molecular fragments on chelation and cytotoxicity. Later investigations, involving the derivation of predictive models, use stepwise regression to determine the optimal pool of SMARTS strings having the greatest influence over the property of interest. By applying a unique scoring system to sets of highly generalized SMARTS strings, we have constructed well balanced regression trees with predictive accuracy exceeding that of many published and commercially available models for cytotoxicity, pKa, and aqueous solubility. The methodology is robust, extremely adaptable, and can handle any molecular dataset with experimental data. This story details our struggles of data gathering, curation, and the development of a machine learning methodology able to derive and validate highly accurate regression trees capable of extremely fast property predictions. Regression trees created by our method are well suited to calculate descriptors for large in silico molecular libraries, facilitating data mining of chemical spaces in search of new lead molecules in drug discovery.Ph.D.Medicinal ChemistryUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64627/1/adamclee_1.pd
    corecore