19 research outputs found

    New developments on the cheminformatics open workflow environment CDK-Taverna

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The computational processing and analysis of small molecules is at heart of cheminformatics and structural bioinformatics and their application in e.g. metabolomics or drug discovery. Pipelining or workflow tools allow for the Lego™-like, graphical assembly of I/O modules and algorithms into a complex workflow which can be easily deployed, modified and tested without the hassle of implementing it into a monolithic application. The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna, the Chemistry Development Kit (CDK) or the Waikato Environment for Knowledge Analysis (WEKA). A first integrated version 1.0 of CDK-Taverna was recently released to the public.</p> <p>Results</p> <p>The CDK-Taverna project was migrated to the most up-to-date versions of its foundational software libraries with a complete re-engineering of its worker's architecture (version 2.0). 64-bit computing and multi-core usage by paralleled threads are now supported to allow for fast in-memory processing and analysis of large sets of molecules. Earlier deficiencies like workarounds for iterative data reading are removed. The combinatorial chemistry related reaction enumeration features are considerably enhanced. Additional functionality for calculating a natural product likeness score for small molecules is implemented to identify possible drug candidates. Finally the data analysis capabilities are extended with new workers that provide access to the open-source WEKA library for clustering and machine learning as well as training and test set partitioning. The new features are outlined with usage scenarios.</p> <p>Conclusions</p> <p>CDK-Taverna 2.0 as an open-source cheminformatics workflow solution matured to become a freely available and increasingly powerful tool for the biosciences. The combination of the new CDK-Taverna worker family with the already available workflows developed by a lively Taverna community and published on myexperiment.org enables molecular scientists to quickly calculate, process and analyse molecular data as typically found in e.g. today's systems biology scenarios.</p

    A constructive approach for discovering new drug leads: Using a kernel methodology for the inverse-QSAR problem

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The inverse-QSAR problem seeks to find a new molecular descriptor from which one can recover the structure of a molecule that possess a desired activity or property. Surprisingly, there are very few papers providing solutions to this problem. It is a difficult problem because the molecular descriptors involved with the inverse-QSAR algorithm must adequately address the forward QSAR problem for a given biological activity if the subsequent recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such a descriptor. The difficulty of recovering the molecule from its descriptor is the major limitation of most inverse-QSAR methods.</p> <p>Results</p> <p>In this paper, we describe the reversibility of our previously reported descriptor, the vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our inverse-QSAR approach can be described using five steps: (1) generate the VSMMD for the compounds in the training set; (2) map the VSMMD in the input space to the kernel feature space using an appropriate kernel function; (3) design or generate a new point in the kernel feature space using a kernel feature space algorithm; (4) map the feature space point back to the input space of descriptors using a pre-image approximation algorithm; (5) build the molecular structure template using our VSMMD molecule recovery algorithm.</p> <p>Conclusion</p> <p>The empirical results reported in this paper show that our strategy of using kernel methodology for an inverse-Quantitative Structure-Activity Relationship is sufficiently powerful to find a meaningful solution for practical problems.</p

    The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

    Get PDF
    open access articleBackground: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. Results: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software

    Building predictive unbound brain-to-plasma concentration ratio (Kp,uu,brain) models

    Get PDF
    Abstract The blood-brain barrier (BBB) constitutes a dynamic membrane primarily evolved to protect the brain from exposure to harmful xenobiotics. The distribution of synthesized drugs across the blood-brain barrier (BBB) is a vital parameter to consider in drug discovery projects involving a central nervous system (CNS) target, since the molecules should be capable of crossing the major hurdle, BBB. In contrast, the peripherally acting drugs have to be designed optimally to minimize brain exposure which could possibly result in undue side effects. It is thus important to establish the BBB permeability of molecules early in the drug discovery pipeline. Previously, most of the in-silico attempts for the prediction of brain exposure have relied on the total drug distribution between the blood plasma and the brain. However, it is now understood that the unbound brain-to-plasma concentration ratio ( Kp,uu,brain) is the parameter that precisely indicates the BBB availability of compounds. Kp,uu,brain describes the free drug concentration of the drug molecule in the brain, which, according to the free drug hypothesis, is the parameter that causes the relevant pharmacological response at the target site. Current work involves revisiting a model built in 2011 and uploaded in an in-house server and checking for its performance on the data collected since then. This gave a satisfying result showing the stability of the model. The old dataset was then further extended with the temporal dataset in order to update the model. This is important to maintain a substantial chemical space so as to ensure a good predictability with unknown data. Using other methods and descriptors not used in the previous study, a further improvement in the model performance was achieved. Attempts were also made in order to interpret the model by identifying the most influential descriptors in the model.Popular science summary: Predictive model for unbound brain-to-plasma concentration ratio Blood-brain barrier (BBB) is a dynamic interface evolved to protect the brain from exposure to toxic xenobiotics and to maintain homeostasis. Distribution of drugs across BBB is critical for any drug discovery project. A drug designed for a target in brain has to pass through the BBB in sufficient concentrations to elicit the desired therapeutic effect. On the other hand, a drug designed for a non-CNS target should be kept away from the brain to avoid fatal side effects. Unbound brain-to-plasma concentration ratio, Kp,uu,brain is a parameter that describes the distribution of a molecule across the BBB. It represents the free drug concentration in the brain, which is the fraction that elicits the pharmacological effect on the CNS. The experimental measurement of this parameter is time consuming and laborious. Computational prediction of such properties thus prove to be of a great utility in reducing the time and resources spent by aiding in the early elimination of compounds possessing undesirable qualities. This helps in reducing late stage compound attrition (failure rate) which has always been a major problem for pharmaceutical industries. Quantitative Structure Activity Relationship (QSAR) is an approach that attempts to establish a meaningful relationship between the chemical structure of a molecule and its chemical/biological activity. Once established, this relationship can be used to predict the activity of a new compound based on its chemical structure. In a typical QSAR experiment, the chemical structures are often represented in terms of numerical values called molecular descriptors. The thesis work utilized machine learning algorithm (Support Vector Machine and Random forest) to define the structure -activity relationship. A predictive model for estimating the unbound brain-to-plasma concentration ratio (Kp,uu,brain) was developed based on a training set of in-house compounds and was mounted in an in-house program (C-lab) in 2011 for routine use. The thesis project involved validating the existing model and updating the model by extending the dataset with the data collected since 2011. Different combinations of Machine Learning algorithms, modeling approaches and molecular descriptors (calculated numerical values representing of chemical structures) were used to build the models. Further, combining the prediction from these models, consensus models were built and validated. Two-class classification models were also evaluated based on categorizing compounds into BBB positive (crosses BBB) or negative (does not cross BBB). The validation of the old model using temporal test set (Kp,uu,brain data collected since 2011) gave a promising result showing stability and good predictive power. However, it is very important to keep the chemical space updated, which defines the purpose for updating the model. The new model (a consensus model with five components) shows a significant improvement in terms of the predictive power along with an improvement in the classification performance. This model will be uploaded to C-lab and will be accessible for use within AstraZeneca. Advisors: Hongming Chen, Ola Engkvist (Computational Chemistry, AstraZeneca R&D Mölndal) Master´s Degree Project 60 credits in Bioinformatics (2014) Department of Biology., Lund Universit

    A retrosynthetic biology approach to metabolic pathway design for therapeutic production

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Synthetic biology is used to develop cell factories for production of chemicals by constructively importing heterologous pathways into industrial microorganisms. In this work we present a retrosynthetic approach to the production of therapeutics with the goal of developing an <it>in situ </it>drug delivery device in host cells. Retrosynthesis, a concept originally proposed for synthetic chemistry, iteratively applies reversed chemical transformations (reversed enzyme-catalyzed reactions in the metabolic space) starting from a target product to reach precursors that are endogenous to the chassis. So far, a wider adoption of retrosynthesis into the manufacturing pipeline has been hindered by the complexity of enumerating all feasible biosynthetic pathways for a given compound.</p> <p>Results</p> <p>In our method, we efficiently address the complexity problem by coding substrates, products and reactions into molecular signatures. Metabolic maps are represented using hypergraphs and the complexity is controlled by varying the specificity of the molecular signature. Furthermore, our method enables candidate pathways to be ranked to determine which ones are best to engineer. The proposed ranking function can integrate data from different sources such as host compatibility for inserted genes, the estimation of steady-state fluxes from the genome-wide reconstruction of the organism's metabolism, or the estimation of metabolite toxicity from experimental assays. We use several machine-learning tools in order to estimate enzyme activity and reaction efficiency at each step of the identified pathways. Examples of production in bacteria and yeast for two antibiotics and for one antitumor agent, as well as for several essential metabolites are outlined.</p> <p>Conclusions</p> <p>We present here a unified framework that integrates diverse techniques involved in the design of heterologous biosynthetic pathways through a retrosynthetic approach in the reaction signature space. Our engineering methodology enables the flexible design of industrial microorganisms for the efficient on-demand production of chemical compounds with therapeutic applications.</p

    Enumerating molecules.

    Full text link

    Kernel Methods in Computer-Aided Constructive Drug Design

    Get PDF
    A drug is typically a small molecule that interacts with the binding site of some target protein. Drug design involves the optimization of this interaction so that the drug effectively binds with the target protein while not binding with other proteins (an event that could produce dangerous side effects). Computational drug design involves the geometric modeling of drug molecules, with the goal of generating similar molecules that will be more effective drug candidates. It is necessary that algorithms incorporate strategies to measure molecular similarity by comparing molecular descriptors that may involve dozens to hundreds of attributes. We use kernel-based methods to define these measures of similarity. Kernels are general functions that can be used to formulate similarity comparisons. The overall goal of this thesis is to develop effective and efficient computational methods that are reliant on transparent mathematical descriptors of molecules with applications to affinity prediction, detection of multiple binding modes, and generation of new drug leads. While in this thesis we derive computational strategies for the discovery of new drug leads, our approach differs from the traditional ligandbased approach. We have developed novel procedures to calculate inverse mappings and subsequently recover the structure of a potential drug lead. The contributions of this thesis are the following: 1. We propose a vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our experiments have provided convincing comparative empirical evidence that our descriptor formulation in conjunction with kernel based regression algorithms can provide sufficient discrimination to predict various biological activities of a molecule with reasonable accuracy. 2. We present a new component selection algorithm KACS (Kernel Alignment Component Selection) based on kernel alignment for a QSAR study. Kernel alignment has been developed as a measure of similarity between two kernel functions. In our algorithm, we refine kernel alignment as an evaluation tool, using recursive component elimination to eventually select the most important components for classification. We have demonstrated empirically and proven theoretically that our algorithm works well for finding the most important components in different QSAR data sets. 3. We extend the VSMMD in conjunction with a kernel based clustering algorithm to the prediction of multiple binding modes, a challenging area of research that has been previously studied by means of time consuming docking simulations. The results reported in this study provide strong empirical evidence that our strategy has enough resolving power to distinguish multiple binding modes through the use of a standard k-means algorithm. 4. We develop a set of reverse engineering strategies for QSAR modeling based on our VSMMD. These strategies include: (a) The use of a kernel feature space algorithm to design or modify descriptor image points in a feature space. (b) The deployment of a pre-image algorithm to map the newly defined descriptor image points in the feature space back to the input space of the descriptors. (c) The design of a probabilistic strategy to convert new descriptors to meaningful chemical graph templates. The most important aspect of these contributions is the presentation of strategies that actually generate the structure of a new drug candidate. While the training set is still used to generate a new image point in the feature space, the reverse engineering strategies just described allows us to develop a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules

    Development and implementation of in silico molecule fragmentation algorithms for the cheminformatics analysis of natural product spaces

    Get PDF
    Computational methodologies extracting specific substructures like functional groups or molecular scaffolds from input molecules can be grouped under the term “in silico molecule fragmentation”. They can be used to investigate what specifically characterises a heterogeneous compound class, like pharmaceuticals or Natural Products (NP) and in which aspects they are similar or dissimilar. The aim is to determine what specifically characterises NP structures to transfer patterns favourable for bioactivity to drug development. As part of this thesis, the first algorithmic approach to in silico deglycosylation, the removal of glycosidic moieties for the study of aglycones, was developed with the Sugar Removal Utility (SRU) (Publication A). The SRU has also proven useful for investigating NP glycoside space. It was applied to one of the largest open NP databases, COCONUT (COlleCtion of Open Natural prodUcTs), for this purpose (Publication B). A contribution was made to the Chemistry Development Kit (CDK) by developing the open Scaffold Generator Java library (Publication C). Scaffold Generator can extract different scaffold types and dissect them into smaller parent scaffolds following the scaffold tree or scaffold network approach. Publication D describes the OngLai algorithm, the first automated method to identify homologous series in input datasets, group the member structures of each group, and extract their common core. To support the development of new fragmentation algorithms, the open Java rich client graphical user interface application MORTAR (MOlecule fRagmenTAtion fRamework) was developed as part of this thesis (Publication E). MORTAR allows users to quickly execute the steps of importing a structural dataset, applying a fragmentation algorithm, and visually inspecting the results in different ways. All software developed as part of this thesis is freely and openly available (see https://github.com/JonasSchaub)

    Automated de novo metabolite identification with mass spectrometry and cheminformatics

    Get PDF
    In this thesis new algorithms and methods that enable the de novo identification of metabolites have been developed. The aim was to find methods to propose candidate structures for unknown metabolites using MSn data as starting point. These methods have been integrated into a semi-automated pipeline to identify new human metabolites. The discovery of new metabolites will improve our capability to understand disease via its metabolic fingerprint, to develop personalized treatments and to discover new drugs. In addition, the cheminformatics methods presented in this thesis increase our understanding on the properties of human metabolites. The research described in this thesis has shown that the success of de novo metabolite identification relies on the synergy between analytical chemistry methods (i.e. LC-MSn) and cheminformatics tools.Netherlands Organization for Applied Scientific Research (TNO) Netherlands Metabolomics CentreUBL - phd migration 201
    corecore