8 research outputs found

    Molecular similarity searching based on deep learning for feature reduction

    Get PDF
    The concept of molecular similarity has been widely used in rational drug design, where structurally similar molecules are explored in molecular databases for retrieving functionally similar molecules. The most used conventional similarity methods are two-dimensional (2D) fingerprints to evaluate the similarity of molecules towards a target query. However, these descriptors include redundant and irrelevant features that might impact the effectiveness of similarity searching methods. Moreover, the majority of existing similarity searching methods often disregard the importance of some features over others and assume all features are equally important. Thus, this study proposed three approaches for identifying the important features of molecules in chemical datasets. The first approach was based on the representation of the molecular features using Autoencoder (AE), which removes irrelevant and redundant features. The second approach was the feature selection model based on Deep Belief Networks (DBN), which are used to select only the important features. In this approach, the DBN is used to find subset of features that represent the important ones. The third approach was conducted to include descriptors that complement to each other. Different important features from many descriptors were filtered through DBN and combined to form a new descriptor used for molecular similarity searching. The proposed approaches were experimented on the MDL Data Drug Report standard dataset (MDDR). Based on the test results, the three proposed approaches overcame some of the existing benchmark similarity methods, such as Bayesian Inference Networks (BIN), Tanimoto Similarity Method (TAN), Adapted Similarity Measure of Text Processing (ASMTP) and Quantum-Based Similarity Method (SQB). The results showed that the performance of the three proposed approaches proved to be better in term of average recall values, especially with the use of structurally heterogeneous datasets that could produce results than other methods used previously to improve molecular similarity searching

    Ensemble learning method for the prediction of new bioactive molecules

    Get PDF
    Pharmacologically active molecules can provide remedies for a range of different illnesses and infections. Therefore, the search for such bioactive molecules has been an enduring mission. As such, there is a need to employ a more suitable, reliable, and robust classification method for enhancing the prediction of the existence of new bioactive molecules. In this paper, we adopt a recently developed combination of different boosting methods (Adaboost) for the prediction of new bioactive molecules. We conducted the research experiments utilizing the widely used MDL Drug Data Report (MDDR) database. The proposed boosting method generated better results than other machine learning methods. This finding suggests that the method is suitable for inclusion among the in silico tools for use in cheminformatics, computational chemistry and molecular biology. This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

    Hybrid-enhanced siamese similarity models in ligand-based virtual screen

    Get PDF
    Information technology has become an integral aspect of the drug development process. The virtual screening process (VS) is a computational technique for screening chemical compounds in a reasonable amount of time and cost. The similarity search is one of the primary tasks in VS that estimates a molecule's similarity. It is predicated on the idea that molecules with similar structures may also have similar activities. Many techniques for comparing the biological similarity between a target compound and each compound in the database have been established. Although the approaches have a strong performance, particularly when dealing with molecules with homogenous active structural, they are not enough good when dealing with structurally heterogeneous compounds. The previous works examined many deep learning methods in the enhanced Siamese similarity model and demonstrated that the Enhanced Siamese Multi-Layer Perceptron similarity model (SMLP) and the Siamese Convolutional Neural Network-one dimension similarity model (SCNN1D) have good outcomes when dealing with structurally heterogeneous molecules. To further improve the retrieval effectiveness of the similarity model, we incorporate the best two models in one hybrid model. The reason is that each method gives good results in some classes, so combining them in one hybrid model may improve the retrieval recall. Many designs of the hybrid models will be tested in this study. Several experiments on real-world data sets were conducted, and the findings demonstrated that the new approaches outperformed the previous method

    Fusion of molecular representations and prediction of biological activity using convolutional neural network and transfer learning

    Get PDF
    Basic structural features and physicochemical properties of chemical molecules determine their behaviour during chemical, physical, biological and environmental processes and hence need to be investigated for determining and modelling the actions of the molecule. Computational approaches such as machine learning methods are alternatives to predict physiochemical properties of molecules based on their structures. However, limited accuracy and error rates of these predictions restrict their use. This study developed three classes of new methods based on deep learning convolutional neural network for bioactivity prediction of chemical compounds. The molecules are represented as a convolutional neural network (CNN) with new matrix format to represent the molecular structures. The first class of methods involved the introduction of three new molecular descriptors, namely Mol2toxicophore based on molecular interaction with toxicophores features, Mol2Fgs based on distributed representation for constructing abstract features maps of a selected set of small molecules, and Mol2mat, which is a molecular matrix representation adapted from the well-known 2D-fingerprint descriptors. The second class of methods was based on merging multi-CNN models that combined all the molecular representations. The third class of methods was based on automatic learning of features using values within the neurons of the last layer in the proposed CNN architecture. To evaluate the performance of the methods, a series of experiments were conducted using two standard datasets, namely MDL Drug Data Report (MDDR) and Sutherland datasets. The MDDR datasets comprised 10 homogeneous and 10 heterogeneous activity classes, whilst Sutherland datasets comprised four homogeneous activity classes. Based on the experiments, the Mol2toxicophore showed satisfactory prediction rates of 92% and 80% for homogeneous and heterogeneous activity classes, respectively. The Mol2Fgs was better than Mol2toxicophore with prediction accuracy result of 95% for homogeneous and 90% for heterogeneous activity classes. The Mol2mat molecular representation had the highest prediction accuracy with 97% and 94% for homogeneous and heterogeneous datasets, respectively. The combined multi-CNN model leveraging on the knowledge acquired from the three molecular presentations produced better accuracy rate of 99% for the homogeneous and 98% for heterogeneous datasets. In terms of molecular similarity measure, use of the values in the neurons of the last hidden layer as the automatically learned feature in the multi-CNN model as a novel molecular learning representation was found to perform well with 88.6% in terms of average recall value in 5% structures most similar to the target search. The results have demonstrated that the newly developed methods can be effectively used for bioactivity prediction and molecular similarity searching

    Condorcet and borda count fusion method for ligand-based virtual screening

    Get PDF
    Background: It is known that any individual similarity measure will not always give the best recall of active molecule structure for all types of activity classes. Recently, the effectiveness of ligand-based virtual screening approaches can be enhanced by using data fusion. Data fusion can be implemented using two different approaches: group fusion and similarity fusion. Similarity fusion involves searching using multiple similarity measures. The similarity scores, or ranking, for each similarity measure are combined to obtain the final ranking of the compounds in the database. Results: The Condorcet fusion method was examined. This approach combines the outputs of similarity searches from eleven association and distance similarity coefficients, and then the winner measure for each class of molecules, based on Condorcet fusion, was chosen to be the best method of searching. The recall of retrieved active molecules at top 5% and significant test are used to evaluate our proposed method. The MDL drug data report (MDDR), maximum unbiased validation (MUV) and Directory of Useful Decoys (DUD) data sets were used for experiments and were represented by 2D fingerprints. Conclusions: Simulated virtual screening experiments with the standard two data sets show that the use of Condorcet fusion provides a very simple way of improving the ligand-based virtual screening, especially when the active molecules being sought have a lowest degree of structural heterogeneity. However, the effectiveness of the Condorcet fusion was increased slightly when structural sets of high diversity activities were being sough

    Development of Computational Methods to Predict Protein Pocket Druggability and Profile Ligands using Structural Data

    Get PDF
    This thesis presents the development of computational methods and tools using as input three-dimensional structures data of protein-ligand complexes. The tools are useful to mine, profile and predict data from protein-ligand complexes to improve the modeling and the understanding of the protein-ligand recognition. This thesis is divided into five sub-projects. In addition, unpublished results about positioning water molecules in binding pockets are also presented. I developed a statistical model, PockDrug, which combines three properties (hydrophobicity, geometry and aromaticity) to predict the druggability of protein pockets, with results that are not dependent on the pocket estimation methods. The performance of pockets estimated on apo or holo proteins is better than that previously reported in the literature (Publication I). PockDrug is made available through a web server, PockDrug-Server (http://pockdrug.rpbs.univ-paris-diderot.fr), which additionally includes many tools for protein pocket analysis and characterization (Publication II). I developed a customizable computational workflow based on the superimposition of homologous proteins to mine the structural replacements of functional groups in the Protein Data Bank (PDB). Applied to phosphate groups, we identified a surprisingly high number of phosphate non-polar replacements as well as some mechanisms allowing positively charged replacements. In addition, we observed that ligands adopted a U-shape conformation at nucleotide binding pockets across phylogenetically unrelated proteins (Publication III). I investigated the prevalence of salt bridges at protein-ligand complexes in the PDB for five basic functional groups. The prevalence ranges from around 70% for guanidinium to 16% for tertiary ammonium cations, in this latter case appearing to be connected to a smaller volume available for interacting groups. In the absence of strong carboxylate-mediated salt bridges, the environment around the basic functional groups studied appeared enriched in functional groups with acidic properties such as hydroxyl, phenol groups or water molecules (Publication IV). I developed a tool that allows the analysis of binding poses obtained by docking. The tool compares a set of docked ligands to a reference bound ligand (may be different molecule) and provides a graphic output that plots the shape overlap and a Jaccard score based on comparison of molecular interaction fingerprints. The tool was applied to analyse the docking poses of active ligands at the orexin-1 and orexin-2 receptors found as a result of a combined virtual and experimental screen (Publication V). The review of literature focusses on protein-ligand recognition, presenting different concepts and current challenges in drug discovery.Tässä väitöskirjassa esitetään tietokoneavusteisia menetelmiä ja työkaluja, jotka perustuvat proteiini-ligandikompleksien kolmiulotteisiin rakenteisiin. Ne soveltuvat proteiini-ligandikompleksien rakennetiedon louhimiseen, optimointiin ja ennustamiseen. Tavoitteena on parantaa sekä mallinnusta että käsitystä proteiini-liganditunnistuksesta. Väitöskirjassa työkalut kuvataan viitenä eri alahankkeena. Lisäksi esitetään toistaiseksi julkaisemattomia tuloksia vesimolekyylien asemoinnista proteiinien sitoutumistaskuihin. Kehitin PockDrugiksi kutsumani tilastollisen mallin, joka yhdistää kolme ominaisuutta – hydrofobisuuden, geometrian ja aromaattisuuden – proteiinitaskujen lääkekehityskohteeksi soveltuvuuden ennustamista varten siten, että tulokset ovat riippumattomia sitoutumistaskun sijoitusmenetelmästä. Apo- ja holoproteiinien taskujen ennustaminen toimii paremmin kuin alan kirjallisuudessa on aiemmin kuvattu (Julkaisu I). PockDrug on vapaasti käyttäjien saatavilla PockDrug-verkkopalvelimelta (http://pockdrug.rpbs.univ-paris-diderot.fr), jossa on lisäksi useita työkaluja proteiinin sitoutumiskohdan analyysiin ja karakterisointiin (Julkaisu II). Kehitin myös muokattavissa olevan tietokoneavusteisen prosessin, joka perustuu samankaltaisten proteiinien päällekkäin asetteluun, louhiakseni Protein Data Bankista (PDB) toiminnallisten ryhmien rakenteellisia korvikkeita. Tätä fosfaattiryhmiin soveltaessani tunnistin yllättävän paljon poolittomia fosfaattiryhmän korvikkeita ja joitakin positiivisesti varautuneita korvikkeita mahdollistavia mekanismeja. Lisäksi havaitsin, että ligandit omaksuivat U muotoisen konformaation fylogeneettisesti riippumattomien proteiinien nukleotidien sitoutumistaskuissa (Julkaisu III). Tutkin PDB:n proteiini-ligandikompleksien suolasiltojen yleisyyttä viidelle emäksiselle toiminnalliselle ryhmälle. Suolasiltojen yleisyys vaihteli guanidinium-ionin 70 prosentista tertiääristen ammoniumkationien 16 prosenttiin. Jälkimmäisessä tapauksessa suolasiltojen vähäisyys vaikuttaa riippuvan siitä, että vuorovaikuttaville ryhmille on vähemmän tilaa. Mikäli tarkastellut emäksiset ryhmät eivät osallistuneet vahvoihin karboksylaattivälitteisiin suolasiltoihin, niiden ympäristössä vaikutti olevan runsaasti happamia toiminnallisia ryhmiä, kuten hydroksi- ja fenoliryhmiä sekä vesimolekyylejä (Julkaisu IV). Lopuksi kehitin työkalun, joka mahdollistaa telakoinnista saatujen sitoutumisasentojen analyysin. Työkalu vertaa telakoitua ligandisarjaa sitoutuneeseen vertailuligandiin, joka voi olla eri molekyyli. Graafisena tulosteena saadaan diagrammi ligandien muotojen samankaltaisuudesta ja molekyylivuorovaikutusten sormenjälkiin perustuvasta Jaccard-pistemäärästä. Työkalua sovellettiin oreksiini-1- ja oreksiini-2-reseptoreille yhdistetyllä virtuaalisella ja kokeellisella seulonnalla löydettyjen aktiivisten ligandien sitoutumisasentojen analyysiin (Julkaisu V).Cette thèse présente le développement de méthodes et d’outils informatiques basés sur la structure tridimensionnelle des complexes protéine-ligand. Ces différentes méthodes sont utilisées pour extraire, optimiser et prédire des données à partir de la structure des complexes afin d’améliorer la modélisation et la compréhension de la reconnaissance entre une protéine et un ligand. Ce travail de thèse est divisé en cinq projets. En complément, une étude sur le positionnement des molécules d’eau dans les sites de liaisons a aussi été développée et est présentée. Dans une première partie un modèle statistique, PockDrug, a été mis en place. Il combine trois propriétés de poches protéiques (l’hydrophobicité, la géométrie et l’aromaticité) pour prédire la druggabilité des poches protéiques, si une poche protéique peut lier une molécule drug-like. Le modèle est optimisé pour s’affranchir des différentes méthodes d’estimation de poches protéiques. La qualité des prédictions, est meilleure à la fois sur des poches estimées à partir de protéines apo et holo et est supérieure aux autres modèles de la littérature (Publication I). Le modèle PockDrug est disponible sur un serveur web, PockDrug-Server (http://pockdrug.rpbs.univ-paris-diderot.fr) qui inclus d’autres outils pour l’analyse et la caractérisation des poches protéiques. Dans un second temps un protocole, basé sur la superposition de protéines homologues a été développé pour extraire des replacements structuraux de groupements chimiques fonctionnels à partir de la Protein Data Bank (PDB). Appliqué aux phosphates, un grand nombre de remplacements non-polaires ont été identifié pouvant notamment être chargés positivement. Quelques mécanismes de remplacements ont ainsi pu être analysé. Nous avons, par exemple, observé que le ligand adopte une configuration en forme U dans les sites de liaison des nucléotides indépendamment de la phylogénétique des protéines (Publication III). Dans une quatrième partie, la prévalence des ponts salins de cinq groupements chimiques basiques a été étudié dans les complexes protéine-ligand. Ainsi le pourcentage de pont salin fluctue de 70% pour le guanidinium à 16% pour l’amine tertiaire qui a le plus faible volume disponible autour de lui pour accueillir un group pouvant interagir. L’absence d’acide fort comme l’acide carboxylique pour former un pont salin est remplacé par un milieu enrichis en groupement chimiques fonctionnels avec des propriétés acides comme l’hydroxyle, le phénol ou encore les molécules d’eau (Publication IV). Dans un dernier temps un outil permettant l’analyse des poses de ligand obtenues par une méthode d’ancrage moléculaire a été développé. Cet outil compare ces poses à un ligand de référence, qui peut être une molécule différente en combinant l’information du chevauchement de forme de la pose et du ligand de référence et un score de Jaccard basé sur une comparaison des empreintes d’interaction moléculaires du ligand de référence et de la pose. Cette méthode a été utilisé dans l’analyse des résultats d’ancrage moléculaires pour des ligands actifs pour les récepteurs aux orexine 1 et 2. Ces ligands actifs ont été trouvés à partir de résultats combinant un criblage virtuel et expérimental. La revue de la littérature associée est focalisée sur la reconnaissance moléculaire d’un ligand pour une protéine et présente diffèrent concepts et challenges pour la recherche de nouveaux médicaments
    corecore