312 research outputs found

    Uncertainty estimation for QSAR models using machine learning methods

    Get PDF

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

    Full text link
    By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal organic frameworks (MOFs). At present, we have libraries of over ten thousand synthesized materials and millions of in-silico predicted materials. The fact that we have so many materials opens many exciting avenues to tailor make a material that is optimal for a given application. However, from an experimental and computational point of view we simply have too many materials to screen using brute-force techniques. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We emphasize the importance of data collection, methods to augment small data sets, how to select appropriate training sets. An important part of this review are the different approaches that are used to represent these materials in feature space. The review also includes a general overview of the different ML techniques, but as most applications in porous materials use supervised ML our review is focused on the different approaches for supervised ML. In particular, we review the different method to optimize the ML process and how to quantify the performance of the different methods. In the second part, we review how the different approaches of ML have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. The range of topics illustrates the large variety of topics that can be studied with big-data science. Given the increasing interest of the scientific community in ML, we expect this list to rapidly expand in the coming years.Comment: Editorial changes (typos fixed, minor adjustments to figures

    NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

    Get PDF
    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    Machine learning methods for quantitative structure-property relationship modeling

    Get PDF
    Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2014Due to the high rate of new compounds discovered each day and the morosity/cost of experimental measurements there will always be a significant gap between the number of known chemical compounds and the amount of chemical compounds for which experimental properties are available. This research work is motivated by the fact that the development of new methods for predicting properties and organize huge collections of molecules to reveal certain chemical categories/patterns and select diverse/representative samples for exploratory experiments are becoming essential. This work aims to increase the capability to predict physical, chemical and biological properties, using data mining methods applied to complex non-homogeneous data (chemical structures), for large information repositories. In the first phase of this work, current methodologies in quantitative structure-property modelling were studied. These methodologies attempt to relate a set of selected structure-derived features of a compound to its property using model-based learning. This work focused on solving major issues identified when predicting properties of chemical compounds and on the solutions explored using different molecular representations, feature selection techniques and data mining approaches. In this context, an innovative hybrid approach was proposed in order to improve the prediction power and comprehensibility of QSPR/QSAR problems using Random Forests for feature selection. It is acknowledged that, in general, similar molecules tend to have similar properties; therefore, on the second phase of this work, an instance-based machine learning methodology for predicting properties of compounds using the similarity-based molecular space was developed. However, this type of methodology requires the quantification of structural similarity between molecules, which is often subjective, ambiguous and relies upon comparative judgements, and consequently, there is currently no absolute standard of molecular similarity. In this context, a new similarity method was developed, the non-contiguous atom matching (NAMS), based on the optimal atom alignment using pairwise matching algorithms that take into account both topological profiles and atoms/bonds characteristics. NAMS can then be used for property inference over the molecular metric space using ordinary kriging in order to obtain robust and interpretable predictive results, providing a better understanding of the underlying relationship structure-property.Devido ao crescimento exponencial do número de compostos químicos descobertos diariamente e à morosidade/custo de medições experimentais, existe uma diferença significativa entre o número de compostos químicos conhecidos e a quantidade de compostos para os quais estão disponíveis propriedades experimentais. O desenvolvimento de novos métodos para a previsão de propriedades e organização de grandes coleções de moléculas que permitam revelar certas categorias/padrões químicos e selecionar amostras diversas/representativas para estudos exploratórios estão a tornar-se essenciais. Este trabalho tem como objetivo melhorar a capacidade de prever propriedades físicas, químicas e biológicas, através de métodos de aprendizagem automática aplicados a dados complexos não homogeneos (estruturas químicas), para grandes repositórios de informação. Numa primeira fase deste trabalho, foi feito o estudo de metodologias atualmente aplicadas para a modelação quantitativa entre estruturapropriedades. Estas metodologias tentam relacionar um conjunto seleccionado de descritores estruturais de uma molécula com as suas propriedades, utilizando uma abordagem baseada em modelos. Este trabalho centrou-se em solucionar as principais dificuldades identificadas na previsão de propriedades de compostos químicos e nas soluções exploradas utilizando diferentes representações moleculares, técnicas de seleção de descritores e abordagens de aprendizagem automática. Neste contexto, foi proposta uma abordagem híbrida inovadora para melhorar o capacidade de previsão e compreensão de problemas QSPR/QSAR utilizando o algoritmo "Random Forests" (Florestas Aleatórias) para seleção de descritores. É reconhecido que, em geral, moléculas semelhantes tendem a ter propriedades semelhantes; assim, numa segunda fase deste trabalho foi desenvolvida uma metodologia de aprendizagem automática baseada em instâncias para a previsão de propriedades de compostos químicos utilizando o espaço métrico construído a partir da semelhança estrutural entre moléculas. No entanto, este tipo de metodologia requer a quantificação de semelhança estrutural entre moléculas, o que é muitas vezes uma tarefa subjetiva, ambígua e dependente de julgamentos comparativos e, consequentemente, não existe atualmente nenhum padrão absoluto para definir semelhança molecular. Neste âmbito, foi desenvolvido um novo método de semelhança molecular, o “Non-Contiguous Atom Matching Structural Similarity” (NAMS), que se baseia no alinhamento de átomos utilizando algoritmos de emparelhamento que têm em conta os perfis topológicos das ligações e as características dos átomos e ligações. O espaço métrico molecular construído utilizando o NAMS pode ser aplicado à inferência de propriedades usando uma técnica de interpolação espacial, a "krigagem", que tem em conta a relação espacial entre as instâncias, com o objetivo de se obter uma previsão consistente e interpretável, proporcionando uma melhor compreensão da relação entre estrutura-propriedades.Fundação para a Ciência e a Tecnologia (FCT

    Prioritizing Small Molecules for Drug Discovery or Chemical Safety Assessments using Ligand- and Structure-based Cheminformatics Approaches

    Get PDF
    Recent growth in the experimental data describing the effects of chemicals at the molecular, cellular, and organism level has triggered the development of novel computational approaches for the prediction of a chemical's effect on an organism. The studies described in this dissertation research predict chemical activity at three levels of biological complexity: binding of drugs to a single protein target, selective binding to a family of protein targets, and systemic toxicity. Optimizing cheminformatics methods that examine diverse sources of experimental data can lead to novel insight into the therapeutic use and toxicity of chemicals. In the first study, a combinatorial Quantitative Structure-Activity Relationship (QSAR) modeling workflow was successfully applied to the discovery of novel bioactive compound against one specific protein target: histone deacetylase inhibitors (HDACIs). Four candidate molecules were selected from the virtual screening hits to be tested experimentally, and three of them were confirmed active against HDAC. Next, a receptor-based protocol was established and applied to discover target-selective ligands within a family of proteins. This protocol extended the concept of protein/ligand interaction-guided pose selection by employing a binary classifier to discriminate poses of interest from a calibration set. The resulting virtual screening tools were applied for enriching beta2-adrenergic receptor (β2AR) ligands that are selective against other subtypes in the βAR family (i.e. β1AR and β3AR). Moreover, some computational 3D protein structures used in this study have exhibited comparative or even better performance in virtual screening than X-ray crystal structures of β2AR, and therefore computational tools that use these computational structures could complement tools utilizing experimental structures. Finally, a two-step hierarchical QSAR modeling approach was developed to estimate in vivo toxicity effects of small molecules. Besides the chemical structural descriptors, the developed models utilized additional biological information from in vitro bioassays. The derived models were more accurate than traditional QSAR models utilizing chemical descriptors only. Moreover, retrospective analysis of the developed models helped to identify the most informative bioassays, suggesting potential applicability of this methodology in guiding future toxicity experiments. These studies contribute to the development of computational strategies for comprehensive analysis of small molecules' biological properties, and have the potential to be integrated into existing methods for modern rational drug design and discovery.Doctor of Philosoph

    Cheminformatics and artificial intelligence for accelerating agrochemical discovery

    Get PDF
    The global cost-benefit analysis of pesticide use during the last 30 years has been characterized by a significant increase during the period from 1990 to 2007 followed by a decline. This observation can be attributed to several factors including, but not limited to, pest resistance, lack of novelty with respect to modes of action or classes of chemistry, and regulatory action. Due to current and projected increases of the global population, it is evident that the demand for food, and consequently, the usage of pesticides to improve yields will increase. Addressing these challenges and needs while promoting new crop protection agents through an increasingly stringent regulatory landscape requires the development and integration of infrastructures for innovative, cost- and time-effective discovery and development of novel and sustainable molecules. Significant advances in artificial intelligence (AI) and cheminformatics over the last two decades have improved the decision-making power of research scientists in the discovery of bioactive molecules. AI- and cheminformatics-driven molecule discovery offers the opportunity of moving experiments from the greenhouse to a virtual environment where thousands to billions of molecules can be investigated at a rapid pace, providing unbiased hypothesis for lead generation, optimization, and effective suggestions for compound synthesis and testing. To date, this is illustrated to a far lesser extent in the publicly available agrochemical research literature compared to drug discovery. In this review, we provide an overview of the crop protection discovery pipeline and how traditional, cheminformatics, and AI technologies can help to address the needs and challenges of agrochemical discovery towards rapidly developing novel and more sustainable products
    corecore