    Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem

    <p>Abstract</p> <p>Background</p> <p>Recent advances in high-throughput screening (HTS) techniques and readily available compound libraries generated using combinatorial chemistry or derived from natural products enable the testing of millions of compounds in a matter of days. Due to the amount of information produced by HTS assays, it is a very challenging task to mine the HTS data for potential interest in drug development research. Computational approaches for the analysis of HTS results face great challenges due to the large quantity of information and significant amounts of erroneous data produced.</p> <p>Results</p> <p>In this study, Decision Trees (DT) based models were developed to discriminate compound bioactivities by using their chemical structure fingerprints provided in the PubChem system <url>http://pubchem.ncbi.nlm.nih.gov</url>. The DT models were examined for filtering biological activity data contained in four assays deposited in the PubChem Bioassay Database including assays tested for 5HT1a agonists, antagonists, and HIV-1 RT-RNase H inhibitors. The 10-fold Cross Validation (CV) sensitivity, specificity and Matthews Correlation Coefficient (MCC) for the models are 57.2~80.5%, 97.3~99.0%, 0.4~0.5 respectively. A further evaluation was also performed for DT models built for two independent bioassays, where inhibitors for the same HIV RNase target were screened using different compound libraries, this experiment yields enrichment factor of 4.4 and 9.7.</p> <p>Conclusion</p> <p>Our results suggest that the designed DT models can be used as a virtual screening technique as well as a complement to traditional approaches for hits selection.</p

    Models for the Prediction of Receptor Tyrosine Kinase Inhibitory Activity of Substituted 3-Aminoindazole Analogues

    The inhibition of tumor angiogenesis has become a compelling approach in the development of anticancer drugs. In the present study, topological models were developed through decision tree and moving average analysis using a data set comprising 42 analogues of 3-aminoindazoles. A total of 22 descriptors (distance based, adjacency based, pendenticity and distance-cum-adjacency based) were used. The values of all 22 topological indices for each analogue in the dataset were computed using an in-house computer program. A decision tree was constructed for the receptor tyrosine kinase KDR (kinase insert domain receptor) inhibitory activity to determine the importance of topological indices. The decision tree learned the information from the input data with an accuracy of 88%. Three independent topological models were also developed for prediction of receptor tyrosine kinase inhibitory (KDR) activity using moving average analysis. The models developed were also found to be sensitive towards the prediction of other receptor tyrosine kinases i.e. FLT3 (fms-like tyrosine kinase-3) and cKIT inhibitory activity. The accuracy of classification of single index based models using moving average analysis was found to be 88%. The performance of models was assessed by calculating precision, sensitivity, overall accuracy and Mathew’s correlation coefficient (MCC). The significance of the models was also assessed by intercorrelation analysis

    An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

    Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory\u27s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance

    Interpreting linear support vector machine models with heat map molecule coloring

    <p>Abstract</p> <p>Background</p> <p>Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity.</p> <p>Results</p> <p>We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor.</p> <p>Conclusions</p> <p>In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.</p

    Models for Antitubercular Activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines

    The relationship between topological indices and antitubercular activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines has been investigated. A data set consisting of 31 analogues of 5′-O-[(N-Acyl)sulfamoyl]adenosines was selected for the present study. The values of numerous topostructural and topochemical indices for each of 31 differently substituted analogues of the data set were computed using an in-house computer program. Resulting data was analyzed and suitable models were developed through decision tree, random forest and moving average analysis (MAA). The goodness of the models was assessed by calculating overall accuracy of prediction, sensitivity, specificity and Mathews correlation coefficient. Pendentic eccentricity index – a novel highly discriminating, non-correlating pendenticity based topochemical descriptor – was also conceptualized and successfully utilized for the development of a model for antitubercular activity of 5′-O-[(N-Acyl)sulfamoyl]adenosines. The proposed index exhibited not only high sensitivity towards both the presence as well as relative position(s) of pendent/heteroatom(s) but also led to significant reduction in degeneracy. Random forest correctly classified the analogues into active and inactive with an accuracy of 67.74%. A decision tree was also employed for determining the importance of molecular descriptors. The decision tree learned the information from the input data with an accuracy of 100% and correctly predicted the cross-validated (10 fold) data with accuracy up to 77.4%. Statistical significance of proposed models was also investigated using intercorrelation analysis. Accuracy of prediction of proposed MAA models ranged from 90.4 to 91.6%

    Network Pharmacology Approaches for Understanding Traditional Chinese Medicine

    Traditional Chinese medicine (TCM) has obvious efficacy on disease treatments and is a valuable source for novel drug discovery. However, the underlying mechanism of the pharmacological effects of TCM remains unknown because TCM is a complex system with multiple herbs and ingredients coming together as a prescription. Therefore, it is urgent to apply computational tools to TCM to understand the underlying mechanism of TCM theories at the molecular level and use advanced network algorithms to explore potential effective ingredients and illustrate the principles of TCM in system biological aspects. In this thesis, we aim to understand the underlying mechanism of actions in complex TCM systems at the molecular level by bioinformatics and computational tools. In study Ⅰ, a machine learning framework was developed to predict the meridians of the herbs and ingredients. Finally, we achieved high accuracy of the meridians prediction for herbs and ingredients, suggesting an association between meridians and the molecular features of ingredients and herbs, especially the most important features for machine learning models. Secondly, we proposed a novel network approach to study the TCM formulae by quantifying the degree of interactions of pairwise herb pairs in study Ⅱ using five network distance methods, including the closest, shortest, central, kernel, as well as separation. We demonstrated that the distance of top herb pairs is shorter than that of random herb pairs, suggesting a strong interaction in the human interactome. In addition, center methods at the ingredient level outperformed the other methods. It hints to us that the central ingredients play an important role in the herbs. Thirdly, we explored the associations between herbs or ingredients and their important biological characteristics in study III, such as properties, meridians, structures, or targets via clusters from community analysis of the multipartite network. We found that herbal medicines among the same clusters tend to be more similar in the properties, meridians. Similarly, ingredients from the same cluster are more similar in structure and protein target. In summary, this thesis intends to build a bridge between the TCM system and modern medicinal systems using computational tools, including the machine learning model for meridian theory, network modelling for TCM formulae, as well as multipartite network analysis for herbal medicines and their ingredients. We demonstrated that applying novel computational approaches on the integrated high-throughput omics would provide insights for TCM and accelerate the novel drug discovery as well as repurposing from TCM.Perinteinen kiinalainen lääketiede (TCM) on ilmeinen tehokkuus taudin hoidoissa ja on arvokas lähde uuden lääkkeen löytämiseen. TCM: n farmakologisten vaikutusten taustalla oleva mekanismi pysyy kuitenkin tuntemattomassa, koska TCM on monimutkainen järjestelmä, jossa on useita yrttejä ja ainesosia, jotka tulevat yhteen reseptilääkkeeksi. Siksi on kiireellistä soveltaa Laskennallisia työkaluja TCM: lle ymmärtämään TCM-teorioiden taustalla oleva mekanismi molekyylitasolla ja käyttävät kehittyneitä verkkoalgoritmeja tutkimaan mahdollisia tehokkaita ainesosia ja havainnollistavat TCM: n periaatteita järjestelmän biologisissa näkökohdissa. Tässä opinnäytetyössä pyrimme ymmärtämään monimutkaisten TCM-järjestelmien toimintamekanismia molekyylitasolla bioinformaattilla ja laskennallisilla työkaluilla. Tutkimuksessa kehitettiin koneen oppimiskehystä yrttien ja ainesosien meridialaisista. Lopuksi saavutimme korkean tarkkuuden meridiaaneista yrtteistä ja ainesosista, mikä viittaa meridiaaneihin ja ainesosien ja yrtteihin liittyvien molekyylipiirin välillä, erityisesti koneen oppimismalleihin tärkeimmät ominaisuudet. Toiseksi ehdoimme uuden verkon lähestymistavan TCM-kaavojen tutkimiseksi kvantitoimisella vuorovaikutteisten yrttiparien vuorovaikutuksen tutkimuksessa ⅱ käyttämällä viisi verkkoetäisyyttä, mukaan lukien lähin, lyhyt, keskus, ydin sekä erottaminen. Osoitimme, että ylä-yrttiparien etäisyys on lyhyempi kuin satunnaisten yrttiparien, mikä viittaa voimakkaaseen vuorovaikutukseen ihmisellä vuorovaikutteisesti. Lisäksi Center-menetelmät ainesosan tasolla ylittivät muut menetelmät. Se vihjeitä meille, että keskeiset ainesosat ovat tärkeässä asemassa yrtteissä. Kolmanneksi tutkimme yrttien tai ainesosien välisiä yhdistyksiä ja niiden tärkeitä biologisia ominaisuuksia tutkimuksessa III, kuten ominaisuudet, meridiaanit, rakenteet tai tavoitteet klustereiden kautta moniparite-verkoston yhteisön analyysistä. Löysimme, että kasviperäiset lääkkeet samoilla klusterien keskuudessa ovat yleensä samankaltaisia ominaisuuksissa, meridiaaneissa. Samoin saman klusterin ainesosat ovat samankaltaisempia rakenteissa ja proteiinin tavoitteessa. Yhteenvetona tämä opinnäytetyö aikoo rakentaa silta TCM-järjestelmän ja nykyaikaisten lääkevalmisteiden välillä laskentatyökaluilla, mukaan lukien Meridian-teorian koneen oppimismalli, TCM-kaavojen verkkomallinnus sekä kasviperäiset lääkkeet ja niiden ainesosat Osoitimme, että uusien laskennallisten lähestymistapojen soveltaminen integroidulle korkean suorituskyvyttömiehille tarjosivat TCM: n näkemyksiä ja nopeuttaisivat romaanin huumeiden löytöä sekä toistuvat TCM: stä

    Cheminformatics Modeling of Diverse and Disparate Biological Data and the Use of Models to Discover Novel Bioactive Molecules

    Ligand-based drug design is a popular and efficient computational approach to facilitate the drug discovery process. Current approaches mainly focus on optimizing the computational algorithms to improve the efficiency or accuracy of virtual screening; however, the success of ligand-based drug design relies not only on the effectiveness and robustness of the underlying algorithms, but much more importantly, on the quality of the data for model building. Although numerous chemical probe databases have emerged recently, few evaluation of data quality and reliability have been performed. Building upon our lab's experience in Quantitative Structure-Activity Relationship (QSAR) method and methods developed in the field of cheminformatics, this dissertation focuses on: 1) Investigation and comparison of the predictive power of QSAR methods with other ligand-based drug discovery approaches, such as Similarity Ensemble Approach (SEA) and Prediction of Activity Spectra for Substances (PASS); 2) Using QSAR methods to validate the consistency and reliability of biomedical data in disparate data sources. 3) Developing a novel, rigorous and dataset-specific QSAR workflow for the application on multiple therapeutic targets in order to identify diverse hits with high potency in practical virtual screening projects. These works succeed in thoroughly investigating the current approaches for ligand-based drug discovery, exploring the consistency and quality of major annotated cheminformatics databases, and identifying many pharmaceutically important ligands. The success of our studies harshly challenges some popular multi-target profile prediction methods and contributes to the development of cheminformatics by emphasizing the importance of determining trustworthy data sources.Doctor of Philosoph


    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    Machine Learning Methodologies for Interpretable Compound Activity Predictions

    Machine learning (ML) models have gained attention for mining the pharmaceutical data that are currently generated at unprecedented rates and potentially accelerate the discovery of new drugs. The advent of deep learning (DL) has also raised expectations in pharmaceutical research. A central task in drug discovery is the initial search of compounds with desired biological activity. ML algorithms are able to find patterns in compound structures that are related to bioactivity, the so-called structure-activity relationships (SARs). ML-based predictions can complement biological testing to prioritize further experiments. Moreover, insights into model decisions are highly desired for further validation and identification of activity-relevant substructures. However, the interpretation of complex ML models remains essentially prohibitive. This thesis focuses on ML-based predictions of compound activity against multiple biological targets. Single-target and multi-target models are generated for relevant tasks including the prediction of profiling matrices from screening data and the discrimination between weak and strong inhibitors for more than a hundred kinases. Moreover, the relative performance of distinct modeling strategies is systematically analyzed under varying training conditions, and practical guidelines are reported. Since explainable model decisions are a clear requirement for the utility of ML bioactivity models in pharmaceutical research, methods for the interpretation and intuitive visualization of activity predictions from any ML or DL model are introduced. Taken together, this dissertation presents contributions that advance in the application and rationalization of ML models for biological activity and SAR predictions