Search CORE

194 research outputs found

Linguistic measures of chemical diversity and the "keywords" of molecular collections

Author: A Cadeddu
A Kilgarriff
A Roy
B Kowalczyk
B Zhang
C Bian
C Lipinski
D Conte
D Hoover
EJ Martin
F Font-Clos
F Tweedie
FW Goldberg
G Skoraczyński
GM Maggiora
GM Rishton
JW Raymond
K Kettunen
M Krallinger
M Kubát
M Suggitt
MA Covington
ME Welsch
MM Cone
NG Olinghouse
S Soh
WP Walters
Y Cao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2018
Field of study

Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet-indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated

Crossref

ScholarWorks@UNIST

Reduced collision fingerprints and pairwise molecular comparisons for explainable property prediction using Deep Learning

Author: MacDougall Thomas
Publication venue
Publication date: 01/08/2021
Field of study

Les relations entre la structure des composés chimiques et leurs propriétés sont complexes et à haute dimension. Dans le processus de développement de médicaments, plusieurs proprié- tés d’un composé doivent souvent être optimisées simultanément, ce qui complique encore la tâche. Ce travail explore deux représentations des composés chimiques pour les tâches de prédiction des propriétés. L’objectif de ces représentations proposées est d’améliorer l’explicabilité afin de faciliter le processus d’optimisation des propriétés des composés. Pre- mièrement, nous décomposons l’algorithme ECFP (Extended connectivity Fingerprint) et le rendons plus simple pour la compréhension humaine. Nous remplaçons une fonction de hachage sujet aux collisions par une relation univoque de sous structure à bit. Nous consta- tons que ce changement ne se traduit pas par une meilleure performance prédictive d’un perceptron multicouche par rapport à l’ECFP. Toutefois, si la capacité du prédicteur est ra- menée à celle d’un prédicteur linéaire, ses performances sont meilleures que celles de l’ECFP. Deuxièmement, nous appliquons l’apprentissage automatique à l’analyse des paires molécu- laires appariées (MMPA), un paradigme de conception du développement de médicaments. La MMPA compare des paires de composés très similaires, dont la structure diffère par une modification sur un site. Nous formons des modèles de prédiction sur des paires de com- posés afin de prédire les différences d’activité. Nous utilisons des contraintes de similarité par paires comme MMPA, mais nous utilisons également des paires échantillonnées de façon aléatoire pour entraîner les modèles. Nous constatons que les modèles sont plus performants sur des paires choisies au hasard que sur des paires avec des contraintes de similarité strictes. Cependant, les meilleurs modèles par paires ne sont pas capables de battre les performances de prédiction du modèle simple de base. Ces deux études, RCFP et comparaisons par paires, visent à aborder la prédiction des propriétés d’une manière plus compréhensible. En utili- sant l’intuition et l’expérience des chimistes médicinaux dans le cadre de la modélisation prédictive, nous espérons encourager l’explicabilité en tant que composante nécessaire des modèles cheminformatiques prédictifs.The relationships between the structure of chemical compounds and their properties are complex and high dimensional. In the drug development process, multiple properties of a compound often need to be optimized simultaneously, further complicating the task. This work explores two representations of chemical compounds for property prediction tasks. The goal of these suggested representations is improved explainability to better understand the compound property optimization process. First, we decompose the Extended Connectivity Fingerprint (ECFP) algorithm and make it more straightforward for human understanding. We replace a collision-prone hash function with a one-to-one substructure-to-bit relationship. We find that this change which does not translate to higher predictive performance of a multi- layer perceptron compared to ECFP. However, if the capacity of the predictor is lowered to that of a linear predictor, it does perform better than ECFP. Second, we apply machine learning to Matched Molecular Pair Analysis (MMPA), a drug development design paradigm. MMPA compares pairs of highly similar compounds, differing in structure by modification at one site. We train prediction models on pairs of compounds to predict differences in activity. We use pairwise similarity constraints like MMPA, but also use randomly sampled pairs to train the models. We find that models perform better on randomly chosen pairs than on pairs with strict similarity constraints. However, the best pairwise models are not able to beat the prediction performance of the simpler baseline single model. Both of these investigations, RCFP and pairwise comparisons, aim to approach property prediction in a more explainable way. By using intuition and experience of medicinal chemists within predictive modelling, we hope to encourage explainability as a necessary component of predictive cheminformatic models

Dépôt Institutionnel Numérique

Methods for the Analysis of Matched Molecular Pairs and Chemical Space Representations

Author: de la Vega de León Antonio
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Compound optimization is a complex process where different properties are optimized to increase the biological activity and therapeutic effects of a molecule. Frequently, the structure of molecules is modified in order to improve their property values. Therefore, computational analysis of the effects of structure modifications on property values is of great importance for the drug discovery process. It is also essential to analyze chemical space, i.e., the set of all chemically feasible molecules, in order to find subsets of molecules that display favorable property values. This thesis aims to expand the computational repertoire to analyze the effect of structure alterations and visualize chemical space. Matched molecular pairs are defined as pairs of compounds that share a large common substructure and only differ by a small chemical transformation. They have been frequently used to study property changes caused by structure modifications. These analyses are expanded in this thesis by studying the effect of chemical transformations on the ionization state and ligand efficiency, both measures of great importance in drug design. Additionally, novel matched molecular pairs based on retrosynthetic rules are developed to increase their utility for prospective use of chemical transformations in compound optimization. Further, new methods based on matched molecular pairs are described to obtain preliminary SAR information of screening hit compounds and predict the potency change caused by a chemical transformation. Visualizations of chemical space are introduced to aid compound optimization efforts. First, principal component plots are used to rationalize a matched molecular pair based multi-objective compound optimization procedure. Then, star coordinate and parallel coordinate plots are introduced to analyze drug-like subspaces, where compounds with favorable property values can be found. Finally, a novel network-based visualization of high-dimensional property space is developed. Concluding, the applications developed in this thesis expand the methodological spectrum of computer-aided compound optimization

bonndoc – Der Publikationsserver der Universität Bonn

Development and Interpretation of Machine Learning Models for Drug Discovery

Author: Balfer Jenny
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

In drug discovery, domain experts from different fields such as medicinal chemistry, biology, and computer science often collaborate to develop novel pharmaceutical agents. Computational models developed in this process must be correct and reliable, but at the same time interpretable. Their findings have to be accessible by experts from other fields than computer science to validate and improve them with domain knowledge. Only if this is the case, the interdisciplinary teams are able to communicate their scientific results both precisely and intuitively. This work is concerned with the development and interpretation of machine learning models for drug discovery. To this end, it describes the design and application of computational models for specialized use cases, such as compound profiling and hit expansion. Novel insights into machine learning for ligand-based virtual screening are presented, and limitations in the modeling of compound potency values are highlighted. It is shown that compound activity can be predicted based on high-dimensional target profiles, without the presence of molecular structures. Moreover, support vector regression for potency prediction is carefully analyzed, and a systematic misprediction of highly potent ligands is discovered. Furthermore, a key aspect is the interpretation and chemically accessible representation of the models. Therefore, this thesis focuses especially on methods to better understand and communicate modeling results. To this end, two interactive visualizations for the assessment of naive Bayes and support vector machine models on molecular fingerprints are presented. These visual representations of virtual screening models are designed to provide an intuitive chemical interpretation of the results

bonndoc – Der Publikationsserver der Universität Bonn

Multi-faceted Structure-Activity Relationship Analysis Using Graphical Representations

Author: Iyer Preeti Ramesh
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

A core focus in medicinal chemistry is the interpretation of structure-activity relationships (SARs) of small molecules. SAR analysis is typically carried out on a case-by-case basis for compound sets that share activity against a given target. Although SAR investigations are not a priori dependent on computational approaches, limitations imposed by steady rise in activity information have necessitated the use of such methodologies. Moreover, understanding SARs in multi-target space is extremely difficult. Conceptually different computational approaches are reported in this thesis for graphical SAR analysis in single- as well as multi-target space. Activity landscape models are often used to describe the underlying SAR characteristics of compound sets. Theoretical activity landscapes that are reminiscent of topological maps intuitively represent distributions of pair-wise similarity and potency difference information as three-dimensional surfaces. These models provide easy access to identification of various SAR features. Therefore, such landscapes for actual data sets are generated and compared with graph-based representations. Existing graphical data structures are adapted to include mechanism of action information for receptor ligands to facilitate simultaneous SAR and mechanism-related analyses with the objective of identifying structural modifications responsible for switching molecular mechanisms of action. Typically, SAR analysis focuses on systematic pair-wise relationships of compound similarity and potency differences. Therefore, an approach is reported to calculate SAR feature probabilities on the basis of these pair-wise relationships for individual compounds in a ligand set. The consequent expansion of feature categories improves the analysis of local SAR environments. Graphical representations are designed to avoid a dependence on preconceived SAR models. Such representations are suitable for systematic large-scale SAR exploration. Methods for the navigation of SARs in multi-target space using simple and interpretable data structures are introduced. In summary, multi-faceted SAR analysis aided by computational means forms the primary objective of this dissertation

bonndoc – Der Publikationsserver der Universität Bonn

Recommended from our members

Machine Learning Methods for Modeling Synthesizable Molecules

Author: Bradshaw John
Publication venue: University of Cambridge
Publication date: 01/02/2021
Field of study

The search for new molecules often involves cycles of design-make-test-analyze steps, where new molecules are designed, synthesized in a lab, tested, and then analyzed to inform what is to be designed next. This thesis proposes new machine learning (ML) methods to augment chemists in the design and make steps of this process, focusing on the tasks of (a) how to use ML to predict chemical reaction outcomes, and (b) how to build generative models to search for new molecules. We take a common approach to both tasks, building our ML models around existing powerful tools and abstractions from the field of chemistry, and in doing so, show that the tasks we tackle are intrinsically linked. Reaction prediction is important for validating synthesis plans before carrying them out. Many previous ML approaches to reaction prediction have treated reactions as either a black box translation or a single graph edit operation. Instead, we propose a model (ELECTRO) that predicts the reaction products through modeling a sequence of electron movements. We show how modeling electron movements in this way has the benefit of being easy for chemists to interpret, and also is a natural format in which to incorporate the constraints of chemistry, such as balanced atom counts before and after a reaction. We show that our model achieves excellent performance on an important subset of chemical reactions and recovers a basic knowledge of chemistry without explicit supervision. In designing new models to search for molecules with particular properties, it is important that the models describe not only what molecule to make, but also crucially how to make it. These instructions form a synthesis plan, describing how easy-to-obtain building blocks can be combined together to form more complex molecules of interest through chemical reactions. Inspired by this real-world process, we develop two machine learning approaches that incorporate reactions into the virtual generation of new molecules. We show that aligning our model with the real-world process allows us to better link up the design and make steps involved in molecule search, and permits chemists to examine the practicability of both the final molecules we suggest and their synthetic routes. Molecule search is inherently an extrapolation task, and we show that by building our methods around the inductive biases of modeling reactions, we can generalize to new chemical spaces, suggesting molecules that not only perform well, but are synthesizable too.EPSR

Apollo (Cambridge)

MPG.PuRe

Recommended from our members

Leveraging Transformer Models for Accelerated Drug Discovery

Author: Jiang Songhao
Publication venue: University of Chicago
Publication date: 23/07/2024
Field of study

In the realm of AI-accelerated drug discovery, particularly in de novo drug design, significant challenges include unpredictable drug responses in clinical trials, biases in predictive models, and the opaque nature of AI methodologies that complicate the understanding of a drug's mechanism of action. These issues have limited the progression of AI-discovered drugs into clinical trials and regulatory approval. Concurrently, the development of me-too drugs, which involve modifications of existing drugs within the same therapeutic class, presents a less risky and potentially more effective avenue. However, the potential of AI to enhance their development remains largely underexplored. This dissertation aims to transform the development of me-too drugs through the application of AI, with a focus on transformer and large language models (LLMs). It introduces innovative frameworks that utilize the representation learning and generative capabilities of transformer models to refine and expedite the me-too drug development process. These methodologies, referred to as "drug optimization", seek to further accelerate the production of effective me-too drugs. This work makes four significant contributions to the field: (1) It proposes two fusion methods that integrate transformer models with graph neural networks, enhancing the precision of binding affinity predictions. (2) It assembles a comprehensive dataset of 10 million binding affinity values across a diverse array of proteins and drugs, providing an invaluable resource for model training and validation. (3) It proposes two generative models for drug optimization, fine-tuned through reinforcement learning, with the goal of automating and expediting the creation of effective me-too drugs. (4) It introduces an innovative bidirectional GPT model for molecular textual sequences (SMILES), enabling precise generative mask infilling for targeted drug optimization. And by conducting comprehensive evaluations on real world viral and cancer target proteins, we demonstrate that the proposed drug optimization frameworks can consistently enhance existing molecules/drugs

Knowledge UChicago

Prediction is a balancing act: importance of sampling methods to balance sensitivity and specificity of predictive models based on imbalanced chemical data sets

Author: Banerjee Priyanka
Dehnbostel Frederic O.
Preissner Robert
Publication venue
Publication date: 01/01/2018
Field of study

Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets

Institutional Repository of the Freie Universität Berlin

Directory of Open Access Journals

NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

Author: MA CHAO
Publication venue
Publication date: 04/09/2012
Field of study

Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

D-Scholarship@Pitt