15 research outputs found

    From Knowledgebases to Toxicity Prediction and Promiscuity Assessment

    Get PDF
    Polypharmacology marked a paradigm shift in drug discovery from the traditional ‘one drug, one target’ approach to a multi-target perspective, indicating that highly effective drugs favorably modulate multiple biological targets. This ability of drugs to show activity towards many targets is referred to as promiscuity, an essential phenomenon that may as well lead to undesired side-effects. While activity at therapeutic targets provides desired biological response, toxicity often results from non-specific modulation of off-targets. Safety, efficacy and pharmacokinetics have been the primary concerns behind the failure of a majority of candidate drugs. Computer-based (in silico) models that can predict the pharmacological and toxicological profiles complement the ongoing efforts to lower the high attrition rates. High-confidence bioactivity data is a prerequisite for the development of robust in silico models. Additionally, data quality has been a key concern when integrating data from publicly-accessible bioactivity databases. A majority of the bioactivity data originates from high- throughput screening campaigns and medicinal chemistry literature. However, large numbers of screening hits are considered false-positives due to a number of reasons. In stark contrast, many compounds do not demonstrate biological activity despite being tested in hundreds of assays. This thesis work employs cheminformatics approaches to contribute to the aforementioned diverse, yet highly related, aspects that are crucial in rationalizing and expediting drug discovery. Knowledgebase resources of approved and withdrawn drugs were established and enriched with information integrated from multiple databases. These resources are not only useful in small molecule discovery and optimization, but also in the elucidation of mechanisms of action and off- target effects. In silico models were developed to predict the effects of small molecules on nuclear receptor and stress response pathways and human Ether-à-go-go-Related Gene encoded potassium channel. Chemical similarity and machine-learning based methods were evaluated while highlighting the challenges involved in the development of robust models using public domain bioactivity data. Furthermore, the true promiscuity of the potentially frequent hitter compounds was identified and their mechanisms of action were explored at the molecular level by investigating target-ligand complexes. Finally, the chemical and biological spaces of the extensively tested, yet inactive, compounds were investigated to reconfirm their potential to be promising candidates.Die Polypharmakologie beschreibt einen Paradigmenwechsel von "einem Wirkstoff - ein Zielmolekül" zu "einem Wirkstoff - viele Zielmoleküle" und zeigt zugleich auf, dass hochwirksame Medikamente nur durch die Interaktion mit mehreren Zielmolekülen Ihre komplette Wirkung entfalten können. Hierbei ist die biologische Aktivität eines Medikamentes direkt mit deren Nebenwirkungen assoziiert, was durch die Interaktion mit therapeutischen bzw. Off-Targets erklärt werden kann (Promiskuität). Ein Ungleichgewicht dieser Wechselwirkungen resultiert oftmals in mangelnder Wirksamkeit, Toxizität oder einer ungünstigen Pharmakokinetik, anhand dessen man das Scheitern mehrerer potentieller Wirkstoffe in ihrer präklinischen und klinischen Entwicklungsphase aufzeigen kann. Die frühzeitige Vorhersage des pharmakologischen und toxikologischen Profils durch computergestützte Modelle (in-silico) anhand der chemischen Struktur kann helfen den Prozess der Medikamentenentwicklung zu verbessern. Eine Voraussetzung für die erfolgreiche Vorhersage stellen zuverlässige Bioaktivitätsdaten dar. Allerdings ist die Datenqualität oftmals ein zentrales Problem bei der Datenintegration. Die Ursache hierfür ist die Verwendung von verschiedenen Bioassays und „Readouts“, deren Daten zum Großteil aus primären und bestätigenden Bioassays gewonnen werden. Während ein Großteil der Treffer aus primären Assays als falsch-positiv eingestuft werden, zeigen einige Substanzen keine biologische Aktivität, obwohl sie in beiden Assay- Typen ausgiebig getestet wurden (“extensively assayed compounds”). In diese Arbeit wurden verschiedene chemoinformatische Methoden entwickelt und angewandt, um die zuvor genannten Probleme zu thematisieren sowie Lösungsansätze aufzuzeigen und im Endeffekt die Arzneimittelforschung zu beschleunigen. Hierfür wurden nicht redundante, Hand-validierte Wissensdatenbanken für zugelassene und zurückgezogene Medikamente erstellt und mit weiterführenden Informationen angereichert, um die Entdeckung und Optimierung kleiner organischer Moleküle voran zu treiben. Ein entscheidendes Tool ist hierbei die Aufklärung derer Wirkmechanismen sowie Off-Target-Interaktionen. Für die weiterführende Charakterisierung von Nebenwirkungen, wurde ein Hauptaugenmerk auf Nuklearrezeptoren, Pathways in welchen Stressrezeptoren involviert sind sowie den hERG-Kanal gelegt und mit in-silico Modellen simuliert. Die Erstellung dieser Modelle wurden Mithilfe eines integrativen Ansatzes aus “state-of-the-art” Algorithmen wie Ähnlichkeitsvergleiche und “Machine- Learning” umgesetzt. Um ein hohes Maß an Vorhersagequalität zu gewährleisten, wurde bei der Evaluierung der Datensätze explizit auf die Datenqualität und deren chemische Vielfalt geachtet. Weiterführend wurden die in-silico-Modelle dahingehend erweitert, das Substrukturfilter genauer betrachtet wurden, um richtige Wirkmechanismen von unspezifischen Bindungsverhalten (falsch- positive Substanzen) zu unterscheiden. Abschließend wurden der chemische und biologische Raum ausgiebig getesteter, jedoch inaktiver, kleiner organischer Moleküle (“extensively assayed compounds”) untersucht und mit aktuell zugelassenen Medikamenten verglichen, um ihr Potenzial als vielversprechende Kandidaten zu bestätigen

    Substructural Analysis Using Evolutionary Computing Techniques

    Get PDF
    Substructural analysis (SSA) was one of the very first machine learning techniques to be applied to chemoinformatics in the area of virtual screening. For this method, given a set of compounds typically defined by their fragment occurrence data (such as 2D fingerprints). The SSA computes weights for each of the fragments which outlines its contribution to the activity (or inactivity) of compounds containing that fragment. The overall probability of activity for a compound is then computed by summing up or combining the weights for the fragments present in the compound. A variety of weighting schemes based on specific relationship-bound equations are available for this purpose. This thesis identifies uplift to the effectiveness of SSA, using two evolutionary computation methods based on genetic traits, particularly the genetic algorithm (GA) and genetic programming (GP). Building on previous studies, it was possible to analyse and compare ten published SSA weighting schemes based on a simulated virtual screening experiment. The analysis showed the most effective weighting scheme to be the R4 equation which was a part of document-based weighting schemes. A second experiment was carried out to investigate the application of GA-based weighting scheme for the SSA in comparison to an experiment using the R4 weighting scheme. The GA algorithm is simple in concept focusing purely on suitable weight generation and effective in operation. The findings show that the GA-based SSA is superior to the R4-based SSA, both in terms of active compound retrieval rate and predictive performance. A third experiment investigated the genetic application via a GP-based SSA. Rigorous experiment results showed that the GP was found to be superior to the existing SSA weighting schemes. In general, however, the GP-based SSA was found to be less effective than the GA-based SSA. A final experimented is described in this thesis which sought to explore the feasibility of data fusion on both the GA and GP. It is a method producing a final ranking list from multiple sets of ranking lists, based on several fusion rules. The results indicate that data fusion is a good method to boost GA-and GP-based SSA searching. The RKP rule was considered the most effective fusion rule

    Graph based pattern discovery in protein structures

    Get PDF
    The rapidly growing body of 3D protein structure data provides new opportunities to study the relation between protein structure and protein function. Local structure pattern of proteins has been the focus of recent efforts to link structural features found in proteins to protein function. In addition, structure patterns have demonstrated values in applications such as predicting protein-protein interaction, engineering proteins, and designing novel medicines. My thesis introduces graph-based representations of protein structure and new subgraph mining algorithms to identify recurring structure patterns common to a set of proteins. These techniques enable families of proteins exhibiting similar function to be analyzed for structural similarity. Previous approaches to protein local structure pattern discovery operate in a pairwise fashion and have prohibitive computational cost when scaled to families of proteins. The graph mining strategy is robust in the face of errors in the structure, and errors in the set of proteins thought to share a function. Two collaborations with domain experts at the UNC School of Pharmacy and the UNC Medical School demonstrate the utility of these techniques. The first is to predict the function of several newly characterized protein structures. The second is to identify conserved structural features in evolutionarily related proteins

    Metabolomics Data Processing and Data Analysis—Current Best Practices

    Get PDF
    Metabolomics data analysis strategies are central to transforming raw metabolomics data files into meaningful biochemical interpretations that answer biological questions or generate novel hypotheses. This book contains a variety of papers from a Special Issue around the theme “Best Practices in Metabolomics Data Analysis”. Reviews and strategies for the whole metabolomics pipeline are included, whereas key areas such as metabolite annotation and identification, compound and spectral databases and repositories, and statistical analysis are highlighted in various papers. Altogether, this book contains valuable information for researchers just starting in their metabolomics career as well as those that are more experienced and look for additional knowledge and best practice to complement key parts of their metabolomics workflows

    Development of Computational Methods to Predict Protein Pocket Druggability and Profile Ligands using Structural Data

    Get PDF
    This thesis presents the development of computational methods and tools using as input three-dimensional structures data of protein-ligand complexes. The tools are useful to mine, profile and predict data from protein-ligand complexes to improve the modeling and the understanding of the protein-ligand recognition. This thesis is divided into five sub-projects. In addition, unpublished results about positioning water molecules in binding pockets are also presented. I developed a statistical model, PockDrug, which combines three properties (hydrophobicity, geometry and aromaticity) to predict the druggability of protein pockets, with results that are not dependent on the pocket estimation methods. The performance of pockets estimated on apo or holo proteins is better than that previously reported in the literature (Publication I). PockDrug is made available through a web server, PockDrug-Server (http://pockdrug.rpbs.univ-paris-diderot.fr), which additionally includes many tools for protein pocket analysis and characterization (Publication II). I developed a customizable computational workflow based on the superimposition of homologous proteins to mine the structural replacements of functional groups in the Protein Data Bank (PDB). Applied to phosphate groups, we identified a surprisingly high number of phosphate non-polar replacements as well as some mechanisms allowing positively charged replacements. In addition, we observed that ligands adopted a U-shape conformation at nucleotide binding pockets across phylogenetically unrelated proteins (Publication III). I investigated the prevalence of salt bridges at protein-ligand complexes in the PDB for five basic functional groups. The prevalence ranges from around 70% for guanidinium to 16% for tertiary ammonium cations, in this latter case appearing to be connected to a smaller volume available for interacting groups. In the absence of strong carboxylate-mediated salt bridges, the environment around the basic functional groups studied appeared enriched in functional groups with acidic properties such as hydroxyl, phenol groups or water molecules (Publication IV). I developed a tool that allows the analysis of binding poses obtained by docking. The tool compares a set of docked ligands to a reference bound ligand (may be different molecule) and provides a graphic output that plots the shape overlap and a Jaccard score based on comparison of molecular interaction fingerprints. The tool was applied to analyse the docking poses of active ligands at the orexin-1 and orexin-2 receptors found as a result of a combined virtual and experimental screen (Publication V). The review of literature focusses on protein-ligand recognition, presenting different concepts and current challenges in drug discovery.TĂ€ssĂ€ vĂ€itöskirjassa esitetÀÀn tietokoneavusteisia menetelmiĂ€ ja työkaluja, jotka perustuvat proteiini-ligandikompleksien kolmiulotteisiin rakenteisiin. Ne soveltuvat proteiini-ligandikompleksien rakennetiedon louhimiseen, optimointiin ja ennustamiseen. Tavoitteena on parantaa sekĂ€ mallinnusta ettĂ€ kĂ€sitystĂ€ proteiini-liganditunnistuksesta. VĂ€itöskirjassa työkalut kuvataan viitenĂ€ eri alahankkeena. LisĂ€ksi esitetÀÀn toistaiseksi julkaisemattomia tuloksia vesimolekyylien asemoinnista proteiinien sitoutumistaskuihin. Kehitin PockDrugiksi kutsumani tilastollisen mallin, joka yhdistÀÀ kolme ominaisuutta – hydrofobisuuden, geometrian ja aromaattisuuden – proteiinitaskujen lÀÀkekehityskohteeksi soveltuvuuden ennustamista varten siten, ettĂ€ tulokset ovat riippumattomia sitoutumistaskun sijoitusmenetelmĂ€stĂ€. Apo- ja holoproteiinien taskujen ennustaminen toimii paremmin kuin alan kirjallisuudessa on aiemmin kuvattu (Julkaisu I). PockDrug on vapaasti kĂ€yttĂ€jien saatavilla PockDrug-verkkopalvelimelta (http://pockdrug.rpbs.univ-paris-diderot.fr), jossa on lisĂ€ksi useita työkaluja proteiinin sitoutumiskohdan analyysiin ja karakterisointiin (Julkaisu II). Kehitin myös muokattavissa olevan tietokoneavusteisen prosessin, joka perustuu samankaltaisten proteiinien pÀÀllekkĂ€in asetteluun, louhiakseni Protein Data Bankista (PDB) toiminnallisten ryhmien rakenteellisia korvikkeita. TĂ€tĂ€ fosfaattiryhmiin soveltaessani tunnistin yllĂ€ttĂ€vĂ€n paljon poolittomia fosfaattiryhmĂ€n korvikkeita ja joitakin positiivisesti varautuneita korvikkeita mahdollistavia mekanismeja. LisĂ€ksi havaitsin, ettĂ€ ligandit omaksuivat U muotoisen konformaation fylogeneettisesti riippumattomien proteiinien nukleotidien sitoutumistaskuissa (Julkaisu III). Tutkin PDB:n proteiini-ligandikompleksien suolasiltojen yleisyyttĂ€ viidelle emĂ€ksiselle toiminnalliselle ryhmĂ€lle. Suolasiltojen yleisyys vaihteli guanidinium-ionin 70 prosentista tertiÀÀristen ammoniumkationien 16 prosenttiin. JĂ€lkimmĂ€isessĂ€ tapauksessa suolasiltojen vĂ€hĂ€isyys vaikuttaa riippuvan siitĂ€, ettĂ€ vuorovaikuttaville ryhmille on vĂ€hemmĂ€n tilaa. MikĂ€li tarkastellut emĂ€ksiset ryhmĂ€t eivĂ€t osallistuneet vahvoihin karboksylaattivĂ€litteisiin suolasiltoihin, niiden ympĂ€ristössĂ€ vaikutti olevan runsaasti happamia toiminnallisia ryhmiĂ€, kuten hydroksi- ja fenoliryhmiĂ€ sekĂ€ vesimolekyylejĂ€ (Julkaisu IV). Lopuksi kehitin työkalun, joka mahdollistaa telakoinnista saatujen sitoutumisasentojen analyysin. Työkalu vertaa telakoitua ligandisarjaa sitoutuneeseen vertailuligandiin, joka voi olla eri molekyyli. Graafisena tulosteena saadaan diagrammi ligandien muotojen samankaltaisuudesta ja molekyylivuorovaikutusten sormenjĂ€lkiin perustuvasta Jaccard-pistemÀÀrĂ€stĂ€. Työkalua sovellettiin oreksiini-1- ja oreksiini-2-reseptoreille yhdistetyllĂ€ virtuaalisella ja kokeellisella seulonnalla löydettyjen aktiivisten ligandien sitoutumisasentojen analyysiin (Julkaisu V).Cette thĂšse prĂ©sente le dĂ©veloppement de mĂ©thodes et d’outils informatiques basĂ©s sur la structure tridimensionnelle des complexes protĂ©ine-ligand. Ces diffĂ©rentes mĂ©thodes sont utilisĂ©es pour extraire, optimiser et prĂ©dire des donnĂ©es Ă  partir de la structure des complexes afin d’amĂ©liorer la modĂ©lisation et la comprĂ©hension de la reconnaissance entre une protĂ©ine et un ligand. Ce travail de thĂšse est divisĂ© en cinq projets. En complĂ©ment, une Ă©tude sur le positionnement des molĂ©cules d’eau dans les sites de liaisons a aussi Ă©tĂ© dĂ©veloppĂ©e et est prĂ©sentĂ©e. Dans une premiĂšre partie un modĂšle statistique, PockDrug, a Ă©tĂ© mis en place. Il combine trois propriĂ©tĂ©s de poches protĂ©iques (l’hydrophobicitĂ©, la gĂ©omĂ©trie et l’aromaticitĂ©) pour prĂ©dire la druggabilitĂ© des poches protĂ©iques, si une poche protĂ©ique peut lier une molĂ©cule drug-like. Le modĂšle est optimisĂ© pour s’affranchir des diffĂ©rentes mĂ©thodes d’estimation de poches protĂ©iques. La qualitĂ© des prĂ©dictions, est meilleure Ă  la fois sur des poches estimĂ©es Ă  partir de protĂ©ines apo et holo et est supĂ©rieure aux autres modĂšles de la littĂ©rature (Publication I). Le modĂšle PockDrug est disponible sur un serveur web, PockDrug-Server (http://pockdrug.rpbs.univ-paris-diderot.fr) qui inclus d’autres outils pour l’analyse et la caractĂ©risation des poches protĂ©iques. Dans un second temps un protocole, basĂ© sur la superposition de protĂ©ines homologues a Ă©tĂ© dĂ©veloppĂ© pour extraire des replacements structuraux de groupements chimiques fonctionnels Ă  partir de la Protein Data Bank (PDB). AppliquĂ© aux phosphates, un grand nombre de remplacements non-polaires ont Ă©tĂ© identifiĂ© pouvant notamment ĂȘtre chargĂ©s positivement. Quelques mĂ©canismes de remplacements ont ainsi pu ĂȘtre analysĂ©. Nous avons, par exemple, observĂ© que le ligand adopte une configuration en forme U dans les sites de liaison des nuclĂ©otides indĂ©pendamment de la phylogĂ©nĂ©tique des protĂ©ines (Publication III). Dans une quatriĂšme partie, la prĂ©valence des ponts salins de cinq groupements chimiques basiques a Ă©tĂ© Ă©tudiĂ© dans les complexes protĂ©ine-ligand. Ainsi le pourcentage de pont salin fluctue de 70% pour le guanidinium Ă  16% pour l’amine tertiaire qui a le plus faible volume disponible autour de lui pour accueillir un group pouvant interagir. L’absence d’acide fort comme l’acide carboxylique pour former un pont salin est remplacĂ© par un milieu enrichis en groupement chimiques fonctionnels avec des propriĂ©tĂ©s acides comme l’hydroxyle, le phĂ©nol ou encore les molĂ©cules d’eau (Publication IV). Dans un dernier temps un outil permettant l’analyse des poses de ligand obtenues par une mĂ©thode d’ancrage molĂ©culaire a Ă©tĂ© dĂ©veloppĂ©. Cet outil compare ces poses Ă  un ligand de rĂ©fĂ©rence, qui peut ĂȘtre une molĂ©cule diffĂ©rente en combinant l’information du chevauchement de forme de la pose et du ligand de rĂ©fĂ©rence et un score de Jaccard basĂ© sur une comparaison des empreintes d’interaction molĂ©culaires du ligand de rĂ©fĂ©rence et de la pose. Cette mĂ©thode a Ă©tĂ© utilisĂ© dans l’analyse des rĂ©sultats d’ancrage molĂ©culaires pour des ligands actifs pour les rĂ©cepteurs aux orexine 1 et 2. Ces ligands actifs ont Ă©tĂ© trouvĂ©s Ă  partir de rĂ©sultats combinant un criblage virtuel et expĂ©rimental. La revue de la littĂ©rature associĂ©e est focalisĂ©e sur la reconnaissance molĂ©culaire d’un ligand pour une protĂ©ine et prĂ©sente diffĂšrent concepts et challenges pour la recherche de nouveaux mĂ©dicaments

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Characterization, classification and alignment of protein-protein interfaces

    Get PDF
    Protein structural models provide essential information for the research on protein-protein interactions. In this dissertation, we describe two projects on the analysis of protein interactions using structural information. The focus of the first is to characterize and classify different types of interactions. We discriminate between biological obligate and biological non-obligate interactions, and crystal packing contacts. To this end, we defined six interface properties and used them to compare the three types of interactions in a hand-curated dataset. Based on the analysis, a classifier, named NOXclass, was constructed using a support vector machine algorithm in order to generate predictions of interaction types. NOXclass was tested on a non-redundant dataset of 243 protein-protein interactions and reaches an accuracy of 91.8%. The program is benecial for structural biologists for the interpretation of protein quaternary structures and to form hypotheses about the nature of proteinprotein interactions when experimental data are yet unavailable. In the second part of the dissertation, we present Galinter, a novel program for the geometrical comparison of protein-protein interfaces. The Galinter program aims at identifying similar patterns of different non-covalent interactions at interfaces. It is a graph-based approach optimized for aligning non-covalent interactions. A scoring scheme was developed for estimating the statistical signicance of the alignments. We tested the Galinter method on a published dataset of interfaces. Galinter alignments agree with those delivered by methods based on interface residue comparison and backbone structure comparison. In addition, we applied Galinter on four medically relevant examples of protein mimicry. Our results are consistent with previous human-curated analysis. The Galinter program provides an intuitive method of comparative analysis and visualization of binding modes and may assist in the prediction of interaction partners, and the design and engineering of protein interactions and interaction inhibitors

    Automatic learning for the classification of chemical reactions and in statistical thermodynamics

    Get PDF
    This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations

    Exploring Consensus RNA Substructural Patterns Using Subgraph Mining

    Full text link
    © 2017 IEEE. Frequently recurring RNA ?> structural motifs play important roles in RNA folding process and interaction with other molecules. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is insufficient due to varied and unnormalized substructure data. This prevents us from understanding RNAs functions and their inherent synergistic regulation networks. This article thus proposes a novel labeled graph-based algorithm RnaGraph to uncover frequently RNA substructure patterns. Attribute data and graph data are combined to characterize diverse substructures and their correlations, respectively. Further, a top-k graph pattern mining algorithm is developed to extract interesting substructure motifs by integrating frequency and similarity. The experimental results show that our methods assist in not only modelling complex RNA secondary structures but also identifying hidden but interesting RNA substructure patterns
    corecore