4 research outputs found

    Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction

    Get PDF
    The task of learning an expressive molecular representation is central to developing quantitative structure–activity and property relationships. Traditional approaches rely on group additivity rules, empirical measurements or parameters, or generation of thousands of descriptors. In this paper, we employ a convolutional neural network for this embedding task by treating molecules as undirected graphs with attributed nodes and edges. Simple atom and bond attributes are used to construct atom-specific feature vectors that take into account the local chemical environment using different neighborhood radii. By working directly with the full molecular graph, there is a greater opportunity for models to identify important features relevant to a prediction task. Unlike other graph-based approaches, our atom featurization preserves molecule-level spatial information that significantly enhances model performance. Our models learn to identify important features of atom clusters for the prediction of aqueous solubility, octanol solubility, melting point, and toxicity. Extensions and limitations of this strategy are discussed

    Heuristic methods for support vector machines with applications to drug discovery.

    Get PDF
    The contributions to computer science presented in this thesis were inspired by the analysis of the data generated in the early stages of drug discovery. These data sets are generated by screening compounds against various biological receptors. This gives a first indication of biological activity. To avoid screening inactive compounds, decision rules for selecting compounds are required. Such a decision rule is a mapping from a compound representation to an estimated activity. Hand-coding such rules is time-consuming, expensive and subjective. An alternative is to learn these rules from the available data. This is difficult since the compounds may be characterized by tens to thousands of physical, chemical, and structural descriptors and it is not known which are most relevant to the prediction of biological activity. Further, the activity measurements are noisy, so the data can be misleading. The support vector machine (SVM) is a statistically well-founded learning machine that is not adversely affected by high-dimensional representations and is robust with respect to measurement inaccuracies. It thus appears to be ideally suited to the analysis of screening data. The novel application of the SVM to this domain highlights some shortcomings with the vanilla SVM. Three heuristics are developed to overcome these deficiencies: a stopping criterion, HERMES, that allows good solutions to be found in less time; an automated method, LAIKA, for tuning the Gaussian kernel SVM; and, an algorithm, STAR, that outputs a more compact solution. These heuristics achieve their aims on public domain data and are broadly successful when applied to the drug discovery data. The heuristics and associated data analysis are thus of benefit to both pharmacology and computer science

    inSARa: Hierarchical Networks for the Analysis, Visualization and Prediction of Structure-Activity Relationships

    Get PDF
    Die Kenntnis von Struktur-AktivitĂ€ts-Beziehungen (SARs) kann die Entwicklung neuer Arzneistoffe entscheidend beschleunigen. Die fortlaufend zunehmende Menge an verfĂŒgbaren BioaktivitĂ€tsdaten enthĂ€lt potentiell diese wertvollen SchlĂŒssel-Informationen. Die Herausforderung, die es noch zu lösen gilt, ist die Auswertung dieser Daten. FĂŒr die BewĂ€ltigung dieser Dimensionen werden heutzutage computergestĂŒtzte Verfahren benötigt, die automatisiert, die wichtigsten Informationen ĂŒber SARs extrahieren und möglichst anschaulich und intuitiv fĂŒr den medizinischen Chemiker darstellen. Das Ziel dieser Arbeit war daher, die Entwicklung einer Methode namens inSARa (AbkĂŒrzung fĂŒr „intuitive networks for Structure-Activity Relationship analysis“) zur intuitiven Analyse und Visualisierung von SARs. Die Hauptmerkmale des entwickelten Verfahrens sind hierarchische Netzwerke klar-definierter Substruktur-Beziehungen auf Basis gemeinsamer pharmakophorer Eigenschaften. Hierzu wurde das Konzept des „reduzierten Graphen“ (RG) mit dem intuitiven Konzept der „maximal gemeinsamen Substruktur“ (MCS) kombiniert, wodurch ein besonderer Synergismus fĂŒr die SAR-Interpretation resultiert. Dieser ermöglicht, dass der medizinische Chemiker leicht gemeinsame bzw. bioaktivitĂ€tsbeeinflussende molekulare (pharmakophore) Merkmale in großen, auch strukturell diverseren DatensĂ€tzen, die aus Hunderten oder Tausenden von MolekĂŒlen bestehen, erfassen kann. Verschiedene Analysen (z.B. basierend auf der BioaktivitĂ€ts-Vorhersage mittels kNN-Regression) konnten eine KomplementaritĂ€t oder Überlegenheit der fĂŒr inSARa verwendeten molekularen ReprĂ€sentation und Ähnlichkeitserfassung zum hĂ€ufig verwendeten Ansatz der Fingerprint-basierten Ähnlichkeitsanalyse belegen. Der inSARa Hybrid Ansatz, der inSARa in verschiedenen Varianten mit Fingerprint-basierten Ähnlichkeits-Netzwerken kombiniert, zeigt zudem die Vorteile auf, die aus der Kombination beider Prinzipien resultieren können. Beim Analysieren von DatensĂ€tzen aktiver MolekĂŒle einzelner Zielstrukturen haben sich die ohne BerĂŒcksichtigung von BioaktivitĂ€tsinformation aufgebauten inSARa-Netzwerke als wertvoll fĂŒr verschiedene essentielle Aufgaben der SAR-Analyse erwiesen. Neben gemeinsamen pharmakophoren Eigenschaften lassen sich so auf Grundlage einfacher Regeln bioisosterer Austausch, sprunghafte SARs oder „SAR Hotspots“ und sogenannte „Activity Switches“ erkennen. Die verschiedenen Typen an SAR-Information können sowohl mittels interaktiver Navigation durch die hierarchisch aufgebauten Netzwerke als auch durch automatisierte Netzwerk-Analyse (inSARaauto) identifiziert werden. Der auf inSARaauto aufbauende SARdisco Score ermöglicht zudem analog zum Fingerprint-basierten SAR-Index die globale Charakterisierung der Verteilung von SAR-(Dis-)KontinuitĂ€t in inSARa-Netzwerken. Der Vergleich der inSARa-Netzwerke verschiedener Zielstrukturen auf Basis der Schnittmenge an RG-MCSs hat außerdem gezeigt, dass die fĂŒr die SAR-Interpretation entwickelten inSARa-Netzwerke auch wichtige Information im Hinblick auf Polypharmakologie enthalten. Die Ergebnisse dieser Analyse bestĂ€tigen, dass dieser RG-MCS-basierte Ansatz aufgrund seiner einfachen Interpretierbarkeit und Fokussierung auf Eigenschaften, die in die Protein-Ligand-Bindung involviert sind, das Potential fĂŒr die ErgĂ€nzung verfĂŒgbarer Chemogenomik-AnsĂ€tze zur ligandbasierten Analyse von Target-Ähnlichkeiten und zur Identifizierung von KreuzreaktivitĂ€ten aufweist. Zusammenfassend ist festzustellen, dass von dem in dieser Arbeit entwickelten inSARa-Ansatz somit durch seine vielseitige Anwendbarkeit ein wichtiger Beitrag zur Entwicklung neuer und sicherer Arzneistoffe erwartet werden kann.The analysis of Structure-Activity-Relationships (SARs) of small molecules is a fundamental task in drug discovery as this this knowledge is essential for the medicinal chemist at different stages of drug development. The increasing number of bioactivity data is a valuable source for this key information. Yet, up to now, the organization and mining of these data is one of the major challenges. To tackle this issue, computational methods aiming at the automatic extraction of SARs and their subsequent visualization are needed. Therefore, the goal of this thesis was the development of a method called inSARa (abbreviation for “intuitive networks for Structure-Activity Relationship analysis”) for the intuitive SAR analysis and visualization. The main features of the approach introduced herein are hierarchical networks of clearly-defined substructure relationships based on common pharmacophoric features. The method takes advantage of the synergy resulting from the combination of reduced graphs (RG) and the intuitive concept of the maximum common substructure (MCS). Using inSARa networks, common molecular or pharmacophoric features crucial for bioactivity modification are easily identified in data sets of different size (up to thousands of molecules) and heterogeneity. Various analyses (e.g. based on the prediction of bioactivities using kNN regression) show that the way of molecular representation and perception of similarity used in inSARa is superior to the commonly used concept of fingerprint-based similarity analysis. The inSARa Hybrid approach, which combines inSARa with fingerprint-based similarity networks in different ways, highlights the advantages resulting from the combination of both concepts. When focusing on a set of active molecules at one single target, the resulting inSARa networks are shown to be valuable for various essential tasks in SAR analysis. Based on simple rules not only common pharmacophoric patterns but also bioisosteric exchanges, activity cliffs or ‘SAR hotspots’ and ‘activity switches’ are easily identified. These different types of SAR information are either identified by interactive navigation of the hierarchical networks or automated network analysis (inSARaauto). In Analogy to the fingerprint-based SAR-Index, the SAR disco Score which is based on inSARaauto globally characterize the portion of SAR (dis)continuity in inSARa networks. Additionally, inSARa networks of a large number of different targets were pairwisely compared on the basis of the portion of common RG-MCSs. The results indicate that inSARa networks which were primarily devoloped for SAR interpretation are also valuable for gaining insights in polypharmacology. The promising results of the analysis show that the RG-MCS-based concept can complement published chemogenomic approaches for ligand-based analysis of targets similarities and the identification of cross-reactivities/off-target-relationships. The advantage of the devoloped RG-MCS approach is the easy interpretability and the the fact that molecular features involved in protein-ligand binding are represented. In summary, due to the versatility and the intuitive concept, the introduced inSARa approach is expected to support and stimulate the development of new or safer drugs
    corecore