132 research outputs found

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Effects of Molecular Representation in Predicting the Biological Activity using SVM and PLS Approaches

    Get PDF
    In this work we study and analyze the behavior of different representational spaces for molecular activity prediction. Representational spaces based on fingerprint similarity, structural similarity using maximum common subgraphs (MCS) and all maximum common subgraphs (AMCS) approaches are compared against representational spaces based on structural fragments and non-isomorphic fragments (NIF), built using different molecular descriptors. Support vector machine is used to study the influence of molecular representation in the dataset classification and PLS regression is proposed to construct a QSAR model for the molecular activity predictio

    Modelization and validation of acute toxicity on aquatic fauna applying classification and regression methods

    Get PDF
    L’objectiu principal d’aquest projecte es el d’aplicar diferents mètodes de classificació i de regressió per a poder predir, en un futur, la toxicitat d’una substància química sobre una espècie de peix, Pimephales promelas, a partir de la seva estructura química. La finalitat és poder evitar la mort i el patiment d’aquests animals utilitzats per a les avaluacions toxicològiques Per això, s’ha obtingut una base de dades de 908 molècules amb la seva toxicitat sobre el peix d’estudi i, a través de l’empresa KODE Chemoinformatics s’han obtingut aproximadament 4000 indicadors moleculars per a cada substància els quals s’han reduït a 2500 a través d’un filtratge. Un cop obtinguda la base de dades i amb ajuda del programa Python, s’ha obtingut diferents combinacions de variables que s’han testat amb mètodes de Machine Learning de regressió y classificació aconseguint un error mitjà absolut normalitzat d’un 7,3% com a mínim a partir dels mètodes de regressió aplicats i amb els de classificació una precisió del 60%. Tot i l’obtenció d’uns resultats acceptables per regressió, no són prou fiables per a substituir el mètode tradicional d’avaluació de la toxicitat d’un químic ja que un error de càlcul podria esdevenir en resultats fatals

    Advancing oral delivery of biologics: machine learning predicts peptide stability in the gastrointestinal tract

    Get PDF
    The oral delivery of peptide therapeutics could facilitate precision treatment of numerous gastrointestinal (GI) and systemic diseases with simple administration for patients. However, the vast majority of licensed peptide drugs are currently administered parenterally due to prohibitive peptide instability in the GI tract. As such, the development of GI-stable peptides is receiving considerable investment. This study provides researchers with the first tool to predict the GI stability of peptide therapeutics based solely on the amino acid sequence. Both unsupervised and supervised machine learning techniques were trained on literature-extracted data describing peptide stability in simulated gastric and small intestinal fluid (SGF and SIF). Based on 109 peptide incubations, classification models for SGF and SIF were developed. The best models utilized k-Nearest Neighbor (for SGF) and XGBoost (for SIF) algorithms, with accuracies of 75.1% (SGF) and 69.3% (SIF), and f1 scores of 84.5% (SGF) and 73.4% (SIF) under 5-fold cross-validation. Feature importance analysis demonstrated that peptides’ lipophilicity, rigidity, and size were key determinants of stability. These models are now available to those working on the development of oral peptide therapeutics

    Genome-wide Protein-chemical Interaction Prediction

    Get PDF
    The analysis of protein-chemical reactions on a large scale is critical to understanding the complex interrelated mechanisms that govern biological life at the cellular level. Chemical proteomics is a new research area aimed at genome-wide screening of such chemical-protein interactions. Traditional approaches to such screening involve in vivo or in vitro experimentation, which while becoming faster with the application of high-throughput screening technologies, remains costly and time-consuming compared to in silico methods. Early in silico methods are dependant on knowing 3D protein structures (docking) or knowing binding information for many chemicals (ligand-based approaches). Typical machine learning approaches follow a global classification approach where a single predictive model is trained for an entire data set, but such an approach is unlikely to generalize well to the protein-chemical interaction space considering its diversity and heterogeneous distribution. In response to the global approach, work on local models has recently emerged to improve generalization across the interaction space by training a series of independant models localized to each predict a single interaction. This work examines current approaches to genome-wide protein-chemical interaction prediction and explores new computational methods based on modifications to the boosting framework for ensemble learning. The methods are described and compared to several competing classification methods. Genome-wide chemical-protein interaction data sets are acquired from publicly available resources, and a series of experimental studies are performed in order to compare the the performance of each method under a variety of conditions

    Machine learning approach in pharmacokinetics and toxicity prediction

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A machine learning based drug discovery pipeline: finding new therapies for Cystic Fibrosis

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019O avanço tecnológico e a crescente disponibilidade de dados públicos levaram ao desenvolvimento de metodologias robustas de predição de atividade de compostos com base em aprendizagem automática. Estas metodologias apresentam maior rapidez, eficiência e menores custos que os métodos tradicionais de descoberta de fármacos. Fibrose Quística (FQ) é uma doença autossómica progressiva para a qual existe urgente necessidade de surgimento de novas terapias. Mutações no gene CFTR nos pacientes de FQ levam à produção deficiente do canal de membrana de transporte de aniões CFTR, gerando desequilíbrios iónicos e transporte anormal de fluidos. FQ afeta vários órgãos, os pulmões com mais gravidade, sendo normalmente devido a problemas nestes a causa de morte prematura. A mutação mais prevalente e relevante em FQ é a deleção da fenilalanina 508 (F508del-CFTR). Por esta razão, os principais esforços de descoberta de novos fármacos são direcionados a corrigir ou amenizar os feitos desta mutação. Foi criada uma metodologia com recurso a modelos de aprendizagem automática de classificação e regressão baseada em máquinas de vetores de suporte e Random Forests para descoberta de compostos com potencial terapêutico em FQ a partir de bases de dados de compostos de acesso público. Os compostos mais promissores foram selecionados e testados em laboratório através de ensaios de imunofluorescência com microscopia automatizada de triagem e análise de alto rendimento sobre o efeito na F508del-CFTR, com base na eficiência de tráfego da F508del-CFTR para a membrana plasmática. Os 10 compostos com melhores resultados neste ensaio foram validados com Western Blot e comparados com dois conhecidos compostos corretores da F508del-CFTR. 4 compostos foram identificados como promissores compostos terapêuticos para FQ

    Graph-Based Feature Selection Approach for Molecular Activity Prediction

    Get PDF
    In the construction of QSAR models for the prediction of molecular activity, feature selection is a common task aimed at improving the results and understanding of the problem. The selection of features allows elimination of irrelevant and redundant features, reduces the effect of dimensionality problems, and improves the generalization and interpretability of the models. In many feature selection applications, such as those based on ensembles of feature selectors, it is necessary to combine different selection processes. In this work, we evaluate the application of a new feature selection approach to the prediction of molecular activity, based on the construction of an undirected graph to combine base feature selectors. The experimental results demonstrate the efficiency of the graph-based method in terms of the classification performance, reduction, and redundancy compared to the standard voting method. The graph-based method can be extended to different feature selection algorithms and applied to other cheminformatics problems

    Statistical learning approaches for predicting pharmacological properties of pharmaceutical agents

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Augmenting Structure/Function Relationship Analysis with Deep Learning for the Classification of Psychoactive Drug Activity at Class A G Protein-Coupled Receptors

    Get PDF
    G protein-coupled receptors (GPCRs) initiate intracellular signaling pathways via interaction with external stimuli. [1-5] Despite sharing similar structure and cellular mechanism, GPCRs participate in a uniquely broad range of physiological functions. [6] Due to the size and functional diversity of the GPCR family, these receptors are a major focus for pharmacological applications. [1,7] Current state-of-the-art pharmacology and toxicology research strategies rely on computational methods to efficiently design highly selective, low toxicity compounds. [9], [10] GPCR-targeting therapeutics are associated with low selectivity resulting in increased risk of adverse effects and toxicity. Psychoactive drugs that are active at Class A GPCRs used in the treatment of schizophrenia and other psychiatric disorders display promiscuous binding behavior linked to chronic toxicity and high-risk adverse effects. [16-18] We hypothesized that using a combination of physiochemical feature engineering with a feedforward neural network, predictive models can be trained for these specific GPCR subgroups that are more efficient and accurate than current state-of-the-art methods.. We combined normal mode analysis with deep learning to create a novel framework for the prediction of Class A GPCR/psychoactive drug interaction activities. Our deep learning classifier results in high classification accuracy (5-HT F1-score = 0.78; DRD F1-score = 0.93) and achieves a 45% reduction in model training time when structure-based feature selection is applied via guidance from an anisotropic network model (ANM). Additionally, we demonstrate the interpretability and application potential of our framework via evaluation of highly clinically relevant Class A GPCR/psychoactive drug interactions guided by our ANM results and deep learning predictions. Our model offers an increased range of applicability as compared to other methods due to accessible data compatibility requirements and low model complexity. While this model can be applied to a multitude of clinical applications, we have presented strong evidence for the impact of machine learning in the development of novel psychiatric therapeutics with improved safety and tolerability
    corecore