1,388 research outputs found

    Development and application of fast fuzzy pharmacophore-based virtual screening methods for scaffold hopping

    Get PDF
    The goal of this thesis was the development, evaluation and application of novel virtual screening approaches for the rational compilation of high quality pharmacological screening libraries. The criteria for a high quality were a high probability of the selected molecules to be active compared to randomly selected molecules and diversity in the retrieved chemotypes of the selected molecules to be prepared for the attrition of single lead structures. For the latter criterion the virtual screening approach had to perform “scaffold hopping”. The first molecular descriptor that was explicitly reported for that purpose was the topological pharmacophore CATS descriptor, representing a correlation vector (CV) of all pharmacophore points in a molecule. The representation is alignment-free and thus renders fast screening of large databases feasible. In a first series of experiments the CATS descriptor was conceptually extended to the three-dimensional pharmacophore-pair CATS3D descriptor and the molecular surface based SURFCATS descriptor. The scaling of the CATS3D descriptor, the combination of CATS3D with different similarity metrics and the dependence of the CATS3D descriptor on the threedimensional conformations of the molecules in the virtual screening database were evaluated in retrospective screening experiments. The “scaffold hopping” capabilities of CATS3D and SURFCATS were compared to CATS and the substructure fingerprint MACCS keys. Prospective virtual screening with CATS3D similarity searching was applied for the TAR RNA and the metabotropic glutamate receptor 5 (mGlur5). A combination of supervised and unsupervised neural networks trained on CATS3D descriptors was applied prospectively to compile a focused but still diverse library of mGluR5 modulators. In a second series of experiments the SQUID fuzzy pharmacophore model method was developed, that was aimed to provide a more general query for virtual screening than the CATS family descriptors. A prospective application of the fuzzy pharmacophore models was performed for TAR RNA ligands. In a last experiment a structure-/ligand-based pharmacophore model was developed for taspase1 based on a homology model of the enzyme. This model was applied prospectively for the screening for the first inhibitors of taspase1. The effect of different similarity metrics (Euc: Euclidean distance, Manh: Manhattan distance and Tani: Tanimoto similarity) and different scaling methods (unscaled, scaling1: scaling by the number of atoms, and scaling2: scaling by the added incidences of potential pharmacophore points of atom pairs) on CATS3D similarity searching was evaluated in retrospective virtual screening experiments. 12 target classes of the COBRA database of annotated ligands from recent scientific literature were used for that purpose. Scaling2, a new development for the CATS3D descriptor, was shown to perform best on average in combination with all three similarity metrics (enrichment factor ef (1%): Manh = 11.8 ± 4.3, Euc = 11.9 ± 4.6, Tani = 12.8 ± 5.1). The Tanimoto coefficient was found to perform best with the new scaling method. Using the other scaling methods the Manhattan distance performed best (ef (1%): unscaled: Manh = 9.6 ± 4.0, Euc = 8.1 ± 3.5, Tani = 8.3 ± 3.8; scaling1: Manh = 10.3 ± 4.1, Euc = 8.8 ± 3.6, Tani = 9.1 ± 3.8). Since CATS3D is independent of an alignment, the dependence of a “receptor relevant” conformation might also be weaker compared to other methods like docking. Using such methods might be a possibility to overcome problems like protein flexibility or the computational expensive calculation of many conformers. To test this hypothesis, co-crystal structures of 11 target classes served as queries for virtual screening of the COBRA database. Different numbers of conformations were calculated for the COBRA database. Using only a single conformation already resulted in a significant enrichment of isofunctional molecules on average (ef (1%) = 6.0 ± 6.5). This observation was also made for ligand classes with many rotatable bonds (e.g. HIV-protease: 19.3 ± 6.2 rotatable bonds in COBRA, ef (1%) = 12.2 ± 11.8). On average only an improvement from using the maximum number of conformations (on average 37 conformations / molecule) to using single conformations of 1.1 fold was found. It was found that using more conformations actives and inactives equally became more similar to the reference compounds according to the CATS3D representations. Applying the same parameters as before to calculate conformations for the crystal structure ligands resulted in an average Cartesian RMSD of the single conformations to the crystal structure conformations of 1.7 ± 0.7 Å. For the maximum number of conformations, the RMSD decreased to 1.0 ± 0.5 Å (1.8 fold improvement on average). To assess the virtual screening performance and the scaffold hopping potential of CATS3D and SURFACATS, these descriptors were compared to CATS and the MACCS keys, a fingerprint based on exact chemical substructures. Retrospective screening of ten classes of the COBRA database was performed. According to the average enrichment factors the MACCS keys performed best (ef (1%): MACCS = 17.4 ± 6.4, CATS = 14.6 ± 5.4, CATS3D = 13.9 ± 4.9, SURFCATS = 12.2 ± 5.5). The classes, where MACCS performed best, consisted of a lower average fraction of different scaffolds relative to the number of molecules (0.44 ± 0.13), than the classes, where CATS performed best (0.65 ± 0.13). CATS3D was the best performing method for only a single target class with an intermediate fraction of scaffolds (0.55). SURFCATS was not found to perform best for a single class. These results indicate that CATS and the CATS3D descriptors might be better suited to find novel scaffolds than the MACCS keys. All methods were also shown to complement each other by retrieving scaffolds that were not found by the other methods. A prospective evaluation of CATS3D similarity searching was done for metabotropic glutamate receptor 5 (mGluR5) allosteric modulators. Seven known antagonists of mGluR5 with sub-micromolar IC50 were used as reference ligands for virtual screening of the 20,000 most drug-like compounds – as predicted by an artificial neural network approach – of the Asinex vendor database (194,563 compounds). Eight of 29 virtual screening hits were found with a Ki below 50 µM in a binding assay. Most of the ligands were only moderately specific for mGluR5 (maximum of > 4.2 fold selectivity) relative to mGluR1, the most similar receptor to mGluR5. One ligand exhibited even a better Ki for mGluR1 than for mGluR5 (mGluR5: Ki > 100 µM, mGluR1: Ki = 14 µM). All hits had different scaffolds than the reference molecules. It was demonstrated that the compiled library contained molecules that were different from the reference structures – as estimated by MACCS substructure fingerprints – but were still considered isofunctional by both CATS and CATS3D pharmacophore approaches. Artificial neural networks (ANN) provide an alternative to similarity searching in virtual screening, with the advantage that they incorporate knowledge from a learning procedure. A combination of artificial neural networks for the compilation of a focused but still structurally diverse screening library was employed prospectively for mGluR5. Ensembles of neural networks were trained on CATS3D representations of the training data for the prediction of “mGluR5-likeness” and for “mGluR5/mGluR1 selectivity”, the most similar receptor to mGluR5, yielding Matthews cc between 0.88 and 0.92 as well as 0.88 and 0.91 respectively. The best 8,403 hits (the focused library: the intersection of the best hits from both prediction tasks) from virtually ranking the Enamine vendor database (ca. 1,000,000 molecules), were further analyzed by two self-organizing maps (SOMs), trained on CATS3D descriptors and on MACCS substructure fingerprints. A diverse and representative subset of the hits was obtained by selecting the most similar molecules to each SOM neuron. Binding studies of the selected compounds (16 molecules from each map) gave that three of the molecules from the CATS3D SOM and two of the molecules from the MACCS SOM showed mGluR5 binding. The best hit with a Ki of 21 µM was found in the CATS3D SOM. The selectivity of the compounds for mGluR5 over mGluR1 was low. Since the binding pockets in the two receptors are similar the general CATS3D representation might not have been appropriate for the prediction of selectivity. In both SOMs new active molecules were found in neurons that did not contain molecules from the training set, i. e. the approach was able to enter new areas of chemical space with respect to mGluR5. The combination of supervised and unsupervised neural networks and CATS3D seemed to be suited for the retrieval of dissimilar molecules with the same class of biological activity, rather than for the optimization of molecules with respect to activity or selectivity. A new virtual screening approach was developed with the SQUID (Sophisticated Quantification of Interaction Distributions) fuzzy pharmacophore method. In SQUID pairs of Gaussian probability densities are used for the construction of a CV descriptor. The Gaussians represent clusters of atoms comprising the same pharmacophoric feature within an alignment of several active reference molecules. The fuzzy representation of the molecules should enhance the performance in scaffold hopping. Pharmacophore models with different degrees of fuzziness (resolution) can be defined which might be an appropriate means to compensate for ligand and receptor flexibility. For virtual screening the 3D distribution of Gaussian densities is transformed into a two-point correlation vector representation which describes the probability density for the presence of atom-pairs, comprising defined pharmacophoric features. The fuzzy pharmacophore CV was used to rank CATS3D representations of molecules. The approach was validated by retrospective screening for cyclooxygenase 2 (COX-2) and thrombin ligands. A variety of models with different degrees of fuzziness were calculated and tested for both classes of molecules. Best performance was obtained with pharmacophore models reflecting an intermediate degree of fuzziness. Appropriately weighted fuzzy pharmacophore models performed better in retrospective screening than CATS3D similarity searching using single query molecules, for both COX-2 and thrombin (ef (1%): COX-2: SQUID = 39.2., best CATS3D result = 26.6; Thrombin: SQUID = 18.0, best CATS3D result = 16.7). The new pharmacophore method was shown to complement MOE pharmacophore models. SQUID fuzzy pharmacophore and CATS3D virtual screening were applied prospectively to retrieve novel scaffolds of RNA binding molecules, inhibiting the Tat-TAR interaction. A pharmacophore model was built up from one ligand (acetylpromazine, IC50 = 500 µM) and a fragment of another known ligand (CGP40336A), which was assumed to bind with a comparable binding mode as acetylpromazine. The fragment was flexible aligned to the TAR bound NMR conformation of acetylpromazine. Using an optimized SQUID pharmacophore model the 20,000 most druglike molecules from the SPECS database (229,658 compounds) were screened for Tat-TAR ligands. Both reference inhibitors were also applied for CATS3D similarity searching. A set of 19 molecules from the SQUID and CATS3D results was selected for experimental testing. In a fluorescence resonance energy transfer (FRET) assay the best SQUID hit showed an IC50 value of 46 µM, which represents an approximately tenfold improvement over the reference acetylpromazine. The best hit from CATS3D similarity searching showed an IC50 comparable to acetylpromazine (IC50 = 500 µM). Both hits contained different molecular scaffolds than the reference molecules. Structure-based pharmacophores provide an alternative to ligand-based approaches, with the advantage that no ligands have to be known in advance and no topological bias is introduced. The latter is e.g. favorable for hopping from peptide-like substrates to drug-like molecules. A homology model of the threonine aspartase taspase1 was calculated based on the crystal structures of a homologous isoaspartyl peptidase. Docking studies of the substrate with GOLD identified a binding mode where the cleaved bond was situated directly above the reactive N-terminal threonine. The predicted enzyme-substrate complex was used to derive a pharmacophore model for virtual screening for novel taspase1 inhibitors. 85 molecules were identified from virtual screening with the pharmacophore model as potential taspase1- inhibitors, however biochemical data was not available before the end of this thesis. In summary this thesis demonstrated the successful development, improvement and application of pharmacophore-based virtual screening methods for the compilation of molecule-libraries for early phase drug development. The highest potential of such methods seemed to be in scaffold hopping, the non-trivial task of finding different molecules with the same biological activity.Ziel dieser Arbeit war die Entwicklung, Untersuchung und Anwendung von neuen virtuellen Screening-Verfahren für den rationalen Entwurf hoch-qualitativer Molekül-Datenbanken für das pharmakologische Screening. Anforderung für eine hohe Qualität waren eine hohe a priori Wahrscheinlichkeit für das Vorhandensein aktiver Moleküle im Vergleich zu zufällig zusammengestellten Bibliotheken, sowie das Vorhandensein einer Vielfalt unterschiedlicher Grundstrukturen unter den selektierten Molekülen, um gegen den Ausfall einzelner Leitstrukturen in der weiteren Entwicklung abgesichert zu sein. Notwendig für die letztere Eigenschaft ist die Fähigkeit eines Verfahrens zum „Grundgerüst-Springen“. Der erste Molekül-Deskriptor, der explizit für das „Grundgerüst-Springen“ eingesetzt wurde war der CATS Deskriptor – ein topologischer Korrelations-Vektor („correlation vector“, CV) über alle Pharmakophor-Punkte eines Moleküls. Der Vergleich von Molekülen über den CATS Deskriptor geschieht ohne eine Überlagerung der Moleküle, was den effizienten Einsatz solcher Verfahren für sehr große Molekül-Datenbanken ermöglicht. In einer ersten Serie von Versuchen wurde der CATS Deskriptor erweitert zu dem dreidimensionalen CATS3D Deskriptor und dem auf der Molekül-Oberfläche basierten SURFCATS Deskriptor. In retrospektiven Studien wurde für diese Deskriptoren der Einfluss verschiedener Skalierungs-Methoden, die Kombination mit unterschiedlichen Ähnlichkeits- Metriken und die Auswirkung verschiedener dreidimensionaler Konformationen untersucht. Weiter wurden das Potential der entwickelten Deskriptoren CATS3D und SURFCATS im „Grundgerüst-Springen“ mit CATS und dem Substruktur-Fingerprint MACCS keys verglichen. Prospektive Anwendungen der CATS3D Ähnlichkeitssuche wurden für die TARRNA und den metabotropen Glutamat Rezeptor 5 (mGluR5) durchgeführt. Eine Kombination von überwachten und unüberwachten neuronalen Netzen wurde prospektiv für die Zusammenstellung einer fokussierten aber dennoch diversen Bibliothek von mGluR5 Modulatoren eingesetzt. In einer zweiten Reihe von Versuchen wurde der SQUID Fuzzy Pharmakophor Ansatz entwickelt, mit dem Ziel zu einer noch generelleren Molekül- Beschreibung als mit den Deskriptoren aus der CATS Familie zu gelangen. Eine prospektive Anwendung der „Fuzzy Pharmakophor“ Methode wurde für die TAR-RNA durchgeführt. In einem letzten Versuch wurde für Taspase1 ein Struktur-/Liganden-basiertes Pharmakophor- Modell auf der Grundlage eines Homologie-Modells des Enzyms entwickelt. Dieses wurde für das prospektive Screening nach Taspase1-Inhibitoren eingesetzt. Der Einfluss verschiedener Ähnlichkeits-Metriken (Euk: Euklidische Distanz, Manh: Manhattan Distanz, Tani: Tanimoto Ähnlichkeit) und verschiedener Skalierungs-Methoden (Ohne-Skalierung, Skalierung1: Skalierung aller Werte nach der Anzahl Atome, Skalierung2: Skalierung der Werte eines Paares von Pharmakophor-Punkten entsprechend der Summe aller Pharmakophor-Punkte mit denselben Pharmakophor-Typen) auf die Ähnlichkeits-Suche mit CATS3D wurde in retrospektiven virtuellen Screening Experimenten untersucht. Für diesen Zweck wurden 12 verschiedene Klassen von Rezeptoren und Enzymen aus der COBRA Datenbank von annotierten Liganden aus der jüngeren wissenschaftlichen Literatur eingesetzt. Skalierung2, eine neue Entwicklung für CATS3D, zeigte im Durchschnitt die beste Performanz in Kombination mit allen drei Ähnlichkeits-Metriken (Anreicherungs-Faktor ef (1%): Manh = 11,8 ± 4,3; Euk = 11,9 ± 4,6; Tani = 12,8 ± 5,1). Die Kombination von Skalierung2 mit dem Tanimoto Ähnlichkeits-Koeffizienten lieferte die besten Ergebnisse. In Kombination mit den anderen Skalierungen brachte die Manhattan Distanz die besten Ergebnisse (ef (1%): Ohne-Skalierung: Manh = 9,6 ± 4,0; Euk = 8,1 ± 3,5; Tani = 8,3 ± 3,8; Skalierung1: Manh = 10,3 ± 4,1; Euk = 8,8 ± 3,6; Tani = 9,1 ± 3,8). Da die CATS3D Ähnlichkeits-Suche unabhängig von der Überlagerung einzelner Moleküle ist, könnte ebenfalls eine gewisse Unabhängigkeit von der vorhandenen 3D Konformation bestehen. Eine solche Unabhängigkeit wäre interessant um die zeitaufwendige Berechnung multipler Konformationen zu umgehen. Um diese Hypothese zu untersuchen wurden Co-Kristalle von Liganden aus 11 Klassen von Rezeptoren und Enzymen ausgewählt, um als Anfrage-Strukturen im virtuellen Screening in der COBRA Datenbank zu dienen. Verschiedene Versionen der COBRA Datenbank mit unterschiedlicher Anzahl Konformationen wurden berechnet. Bereits mit einer einzigen Konformation pro Molekül konnte im Mittel eine deutliche Anreicherung an aktiven Molekülen beobachte werden (ef (1%) = 6,0 ± 6,5). Diese Beobachtung beinhaltete auch Klassen von Molekülen mit vielen rotierbaren Bindungen. (z.B. HIV-Protease: 19,3 ± 6,2 rotierbare Bindungen in COBRA, ef (1%) = 12,2 ± 11,8). Im Mittel konnten dazu bei Verwendung der maximalen Anzahl Konformationen (durchschnittlich 37 Konformationen / Molekül) nur eine Verbesserung von 1.1 festgestellt werden. Nach der CATS3D Ähnlichkeit wurden die inaktiven Moleküle im gleichen Maß ähnlicher zu den Referenzen als die aktiven Moleküle. Zum Vergleich konnte durch Verwendung multipler statt einzelner Konformationen eine 1,8-fache Verbesserung des RMSD zu den Konformationen aus den Kristall-Struktur Konformationen erreicht werden (einzelne Konformationen: 1,7 ± 0,7 Å; max. Konformationen: 1,0 ± 0,5 Å). Um die Leistungsfähigkeit von CATS3D und SURFCATS im virtuellen Screening und im Grundgerüst-Springen zu beurteilen, wurden diese Deskriptoren mit CATS und den MACCS keys, einem Fingerprint basierend auf exakten chemischen Substrukturen, verglichen. Für die retrospektive Analyse wurden 10 Klassen von Rezeptoren und Enzymen aus der COBRA Datenbank ausgewählt. Nach den mittleren Anreicherungs-Faktoren ergaben sich für MACCS die besten Resultate (ef (1%): MACCS = 17,4 ± 6,4; CATS = 14,6 ± 5,4; CATS3D = 13,9 ± 4,9; SURFCATS = 12,2 ± 5,5). Es zeigte sich, dass die Klassen, in denen MACCS die besten Ergebnisse erzielen konnte, einen geringen gemittelten Anteil von verschiedenen Grundgerüsten aufwiesen im Verhältnis zu der Anzahl an Molekülen (0,44 ± 0,13) als die Klassen, in denen CATS am besten war (0,65 ± 0,13). CATS3D war nur in einer Klasse mit einem mittleren Anteil von Grundgerüsten (0,55) die beste Methode. SURFCATS war für keine Klasse besser als alle anderen Methoden. Diese Ergebnisse deuten darauf hin, dass Methoden wie CATS und CATS3D besser geeignet sind, um neue Grundgerüste zu finden. Es konnte weiter gezeigt werden, dass sich die Methoden einander ergänzen, dass also mit jeder Methode Grundgerüste gefunden werden konnten, die mit keiner der anderen Methoden gefunden werden konnten. Eine prospektive Anwendung wurde für CATS3D in der Suche nach neuen allosterischen Modulatoren des metabotropen Glutamat Rezeptors 5 (mGluR5) durchgeführt. Sieben bekannte allosterische mGluR5 Antagonisten mit sub-mikromolaren IC50 Werten wurde als Referenzen eingesetzt. Das virtuelle Screening wurde auf den 20.000 von einem künstlichen neuronalen Netz als am wirkstoff-artigsten vorhergesagten Molekülen der Asinex Datenbank (194.563 Moleküle) durchgeführt. Acht der 29 gefundenen Hits aus dem virtuellen Screening zeigten Ki Werte unter 50 µM in einem Bindungs-Assay. Die Mehrheit der Liganden zeigte nur eine geringe Selektivität (Maximum > 4,2-fach) gegenüber mGluR1, dem ähnlichsten Rezeptor zu mGluR5. Einer der Liganden zeigte einen besseren Ki für mGluR1 als für mGluR5 (mGluR5: Ki > 100 µM, mGluR1: Ki = 14 µM). Alle gefundenen Moleküle zeigten verschiedene Grundgerüste als die Referenz Moleküle. Es konnte gezeigt werden, dass die zusammengestellte Bibliothek von den MACCS keys als unterschiedlich zu den Referenz Strukturen betrachtet wurden, von CATS und CATS3D aber noch als isofunktional betracht wurden. Künstliche neuronal Netze („artificial neural net“, ANN) bieten eine Alternative zur Ähnlichkeits-Suche im virtuellen Screening mit dem Vorteil, dass in einer Serie von Liganden enthaltenes implizites Wissen über eine Lernprozedur in ein Modell integrierte werden kann. Eine Kombination von ANNs für die Zusammenstellung einer fokussierten aber dennoch diversen Molekül-Bibliothek wurde prospektiv für die Suche nach mGluR5 Antagonisten eingesetzt. Gruppen von ANNs wurden auf den Basis von CATS3D Repräsentationen für die Vorhersage von „mGluR5-artigkeit“ und „mGluR5/mGluR1 Selektivität“ trainiert. Dabei ergaben sich Matthews cc zwischen 0,88 und 0,92 sowie zwischen 0,88 und 0,91. Die besten 8.403 Hits (die Schnittmenge der besten Hits aus beiden Vorhersagen) aus einem virtuellen Screening der Enamine Datenbank (ca. 1.000.000 Moleküle) ergab die fokussierte Bibliothek. Diese wurde weiter mit Selbstor

    Espaloma-0.3.0: Machine-learned molecular mechanics force field for the simulation of protein-ligand systems and beyond

    Full text link
    Molecular mechanics (MM) force fields -- the models that characterize the energy landscape of molecular systems via simple pairwise and polynomial terms -- have traditionally relied on human expert-curated, inflexible, and poorly extensible discrete chemical parameter assignment rules, namely atom or valence types. Recently, there has been significant interest in using graph neural networks to replace this process, while enabling the parametrization scheme to be learned in an end-to-end differentiable manner directly from quantum chemical calculations or condensed-phase data. In this paper, we extend the Espaloma end-to-end differentiable force field construction approach by incorporating both energy and force fitting directly to quantum chemical data into the training process. Building on the OpenMM SPICE dataset, we curate a dataset containing chemical spaces highly relevant to the broad interest of biomolecular modeling, covering small molecules, proteins, and RNA. The resulting force field, espaloma 0.3.0, self-consistently parametrizes these diverse biomolecular species, accurately predicts quantum chemical energies and forces, and maintains stable quantum chemical energy-minimized geometries. Surprisingly, this simple approach produces highly accurate protein-ligand binding free energies when self-consistently parametrizing protein and ligand. This approach -- capable of fitting new force fields to large quantum chemical datasets in one GPU-day -- shows significant promise as a path forward for building systematically more accurate force fields that can be easily extended to new chemical domains of interest

    CIN++: Enhancing Topological Message Passing

    Full text link
    Graph Neural Networks (GNNs) have demonstrated remarkable success in learning from graph-structured data. However, they face significant limitations in expressive power, struggling with long-range interactions and lacking a principled approach to modeling higher-order structures and group interactions. Cellular Isomorphism Networks (CINs) recently addressed most of these challenges with a message passing scheme based on cell complexes. Despite their advantages, CINs make use only of boundary and upper messages which do not consider a direct interaction between the rings present in the underlying complex. Accounting for these interactions might be crucial for learning representations of many real-world complex phenomena such as the dynamics of supramolecular assemblies, neural activity within the brain, and gene regulation processes. In this work, we propose CIN++, an enhancement of the topological message passing scheme introduced in CINs. Our message passing scheme accounts for the aforementioned limitations by letting the cells to receive also lower messages within each layer. By providing a more comprehensive representation of higher-order and long-range interactions, our enhanced topological message passing scheme achieves state-of-the-art results on large-scale and long-range chemistry benchmarks.Comment: 21 pages, 9 figure

    Cheminformatics Tools to Explore the Chemical Space of Peptides and Natural Products

    Get PDF
    Cheminformatics facilitates the analysis, storage, and collection of large quantities of chemical data, such as molecular structures and molecules' properties and biological activity, and it has revolutionized medicinal chemistry for small molecules. However, its application to larger molecules is still underrepresented. This thesis work attempts to fill this gap and extend the cheminformatics approach towards large molecules and peptides. This thesis is divided into two parts. The first part presents the implementation and application of two new molecular descriptors: macromolecule extended atom pair fingerprint (MXFP) and MinHashed atom pair fingerprint of radius 2 (MAP4). MXFP is an atom pair fingerprint suitable for large molecules, and here, it is used to explore the chemical space of non-Lipinski molecules within the widely used PubChem and ChEMBL databases. MAP4 is a MinHashed hybrid of substructure and atom pair fingerprints suitable for encoding small and large molecules. MAP4 is first benchmarked against commonly used atom pairs and substructure fingerprints, and then it is used to investigate the chemical space of microbial and plants natural products with the aid of machine learning and chemical space mapping. The second part of the thesis focuses on peptides, and it is introduced by a review chapter on approaches to discover novel peptide structures and describing the known peptide chemical space. Then, a genetic algorithm that uses MXFP in its fitness function is described and challenged to generate peptide analogs of peptidic or non-peptidic queries. Finally, supervised and unsupervised machine learning is used to generate novel antimicrobial and non-hemolytic peptide sequences

    Advancing Biomedicine with Graph Representation Learning: Recent Progress, Challenges, and Future Directions

    Full text link
    Graph representation learning (GRL) has emerged as a pivotal field that has contributed significantly to breakthroughs in various fields, including biomedicine. The objective of this survey is to review the latest advancements in GRL methods and their applications in the biomedical field. We also highlight key challenges currently faced by GRL and outline potential directions for future research.Comment: Accepted by 2023 IMIA Yearbook of Medical Informatic

    Scalable Probabilistic Model Selection for Network Representation Learning in Biological Network Inference

    Get PDF
    A biological system is a complex network of heterogeneous molecular entities and their interactions contributing to various biological characteristics of the system. Although the biological networks not only provide an elegant theoretical framework but also offer a mathematical foundation to analyze, understand, and learn from complex biological systems, the reconstruction of biological networks is an important and unsolved problem. Current biological networks are noisy, sparse and incomplete, limiting the ability to create a holistic view of the biological reconstructions and thus fail to provide a system-level understanding of the biological phenomena. Experimental identification of missing interactions is both time-consuming and expensive. Recent advancements in high-throughput data generation and significant improvement in computational power have led to novel computational methods to predict missing interactions. However, these methods still suffer from several unresolved challenges. It is challenging to extract information about interactions and incorporate that information into the computational model. Furthermore, the biological data are not only heterogeneous but also high-dimensional and sparse presenting the difficulty of modeling from indirect measurements. The heterogeneous nature and sparsity of biological data pose significant challenges to the design of deep neural network structures which use essentially either empirical or heuristic model selection methods. These unscalable methods heavily rely on expertise and experimentation, which is a time-consuming and error-prone process and are prone to overfitting. Furthermore, the complex deep networks tend to be poorly calibrated with high confidence on incorrect predictions. In this dissertation, we describe novel algorithms that address these challenges. In Part I, we design novel neural network structures to learn representation for biological entities and further expand the model to integrate heterogeneous biological data for biological interaction prediction. In part II, we develop a novel Bayesian model selection method to infer the most plausible network structures warranted by data. We demonstrate that our methods achieve the state-of-the-art performance on the tasks across various domains including interaction prediction. Experimental studies on various interaction networks show that our method makes accurate and calibrated predictions. Our novel probabilistic model selection approach enables the network structures to dynamically evolve to accommodate incrementally available data. In conclusion, we discuss the limitations and future directions for proposed works

    A guide to machine learning for biologists

    Get PDF
    The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed
    • …
    corecore