16 research outputs found
Efficient Algorithms for Robust Pattern Mining on Structured Objects with Applications to Structure-Based Drug Design
Many complex real-world objects cannot be mapped onto “flat” feature vector representations without a significant loss of information. Recently, the use of graph-based models gained increasing attention in the field of data mining. This
thesis presents efficient methods for fuzzy pattern mining in databases of such graph representations. It is assumed that relatively homogeneous sets of graphs originating from common classes or clusters are given. The class assignment can either result from background knowledge or from a previously performed cluster analysis. The aim of the developed methods is to derive patterns that are characteristic for a given cluster (i.e., that occur in all or most of the members of a cluster). Additionally, it is of interest to derive patterns that are discriminative among different clusters (i.e., that occur in most of the members of one cluster but are missing in most of the members of a different cluster). As a certain variability can be observed also among members of the same cluster, one cannot expect that a characteristic pattern is present in all members of the cluster in identical form. The methods presented in this thesis thus allow also for variations of a characteristic pattern in its different occurrences. This is achieved by arranging the different graph representations in the common framework of a multiple graph alignment. In a graph alignment, nodes with different labels can be assigned to each other (“mismatch”) and dummy nodes can be inserted into the graphs to compensate “missing” nodes. This increases the robustness of the approach against variations in the considered graphs. A scoring system is used to penalize such constellations as they should be allowed only if no “valid” match for a certain node exists in the remaining graph instances. The calculation of a graph alignment can thus be interpreted as an optimization problem and a number of different strategies for the calculation of optimal graph alignments are examined. Once an optimal graph alignment has been calculated, characteristic and discriminative patterns can be derived and a number of different analyses can be performed. In particular, the concept of multiple graph alignment yields a mechanism for mapping the aligned graphs onto “flat” feature vectors that preserves the correspondence among the individual features of the graph and vector representation.
Relibase+/Cavbase, a huge database of putative protein binding sites serves as the main motivation for the methods developed in this thesis. In Cavbase, the protein binding pockets are represented through pseudocenters that describe the spatial position of certain physicochemical properties. In a first-order approximation, graphs can be used as descriptors for the Cavbase entries. The methods presented in this thesis can therefore be used to analyze families of related protein binding pockets. The derived patterns can be used to guide the development of novel and selective inhibitors. Regions of a binding pocket that are common to the members of a protein family (i.e., characteristic patterns) could be addressed to gain affinity for the whole protein family. Regions that vary among the different members of a protein family (i.e., discriminative patterns) indicate features that can be addressed to gain selectivity within the examined protein family
Efficient Algorithms for Robust Pattern Mining on Structured Objects with Applications to Structure-Based Drug Design
Many complex real-world objects cannot be mapped onto “flat” feature vector representations without a significant loss of information. Recently, the use of graph-based models gained increasing attention in the field of data mining. This
thesis presents efficient methods for fuzzy pattern mining in databases of such graph representations. It is assumed that relatively homogeneous sets of graphs originating from common classes or clusters are given. The class assignment can either result from background knowledge or from a previously performed cluster analysis. The aim of the developed methods is to derive patterns that are characteristic for a given cluster (i.e., that occur in all or most of the members of a cluster). Additionally, it is of interest to derive patterns that are discriminative among different clusters (i.e., that occur in most of the members of one cluster but are missing in most of the members of a different cluster). As a certain variability can be observed also among members of the same cluster, one cannot expect that a characteristic pattern is present in all members of the cluster in identical form. The methods presented in this thesis thus allow also for variations of a characteristic pattern in its different occurrences. This is achieved by arranging the different graph representations in the common framework of a multiple graph alignment. In a graph alignment, nodes with different labels can be assigned to each other (“mismatch”) and dummy nodes can be inserted into the graphs to compensate “missing” nodes. This increases the robustness of the approach against variations in the considered graphs. A scoring system is used to penalize such constellations as they should be allowed only if no “valid” match for a certain node exists in the remaining graph instances. The calculation of a graph alignment can thus be interpreted as an optimization problem and a number of different strategies for the calculation of optimal graph alignments are examined. Once an optimal graph alignment has been calculated, characteristic and discriminative patterns can be derived and a number of different analyses can be performed. In particular, the concept of multiple graph alignment yields a mechanism for mapping the aligned graphs onto “flat” feature vectors that preserves the correspondence among the individual features of the graph and vector representation.
Relibase+/Cavbase, a huge database of putative protein binding sites serves as the main motivation for the methods developed in this thesis. In Cavbase, the protein binding pockets are represented through pseudocenters that describe the spatial position of certain physicochemical properties. In a first-order approximation, graphs can be used as descriptors for the Cavbase entries. The methods presented in this thesis can therefore be used to analyze families of related protein binding pockets. The derived patterns can be used to guide the development of novel and selective inhibitors. Regions of a binding pocket that are common to the members of a protein family (i.e., characteristic patterns) could be addressed to gain affinity for the whole protein family. Regions that vary among the different members of a protein family (i.e., discriminative patterns) indicate features that can be addressed to gain selectivity within the examined protein family
Benchmarking molecular feature attribution methods with activity cliffs
Feature attribution techniques are popular choices within the explainable artificial intelligence toolbox, as they can help elucidate which parts of the provided inputs used by an underlying supervised-learning method are considered relevant for a specific prediction. In the context of molecular design, these approaches typically involve the coloring of molecular graphs, whose presentation to medicinal chemists can be useful for making a decision of which compounds to synthesize or prioritize. The consistency of the highlighted moieties alongside expert background knowledge is expected to contribute to the understanding of machine-learning models in drug design. Quantitative evaluation of such coloring approaches, however, has so far been limited to substructure identification tasks. We here present an approach that is based on maximum common substructure algorithms applied to experimentally-determined activity cliffs. Using the proposed benchmark, we found that molecule coloring approaches in conjunction with classical machine-learning models tend to outperform more modern, deep-learning-based alternatives. However, none of the tested feature attribution methods sufficiently and consistently generalized when confronted with unseen examples
Benchmarking Molecular Feature Attribution Methods with Activity Cliffs
Feature attribution techniques are popula r choices within the explainable artificial intelligence toolbox, as they can help elucidate which parts of the provided inputs used by an underlying supervised-learning method are considered relevant for a specific prediction. In the context of molecular design, these approaches typically involve the coloring of molecular graphs, whose presentation to medicinal chemists can be useful for making a decision of which compounds to synthesize or prioritize. The consistency of the highlighted moieties alongside expert background knowledge is expected to contribute to the understanding of machine-learning models in drug design. Quantitative evaluation of such coloring approaches, however, has so far been limited to substructure identification tasks. We here present an approach that is based on maximum common substructure algorithms applied to experimentally-determined activity cliffs. Using the proposed benchmark, we found that molecule coloring approaches in conjunction with classical machine-learning models tend to outperform more modern, graph-neural-network alternatives. The provided benchmark data are fully open sourced, which we hope will facilitate the testing of newly developed molecular feature attribution techniques.ISSN:1549-9596ISSN:0095-2338ISSN:1520-514
Coloring Molecules with Explainable Artificial Intelligence for Preclinical Relevance Assessment
Graph neural networks are able to solve certain drug discovery tasks such as molecular property prediction and de novo molecule generation. However, these models are considered \u27black-box\u27 and \u27hard-to-debug\u27. This study aimed to improve modeling transparency for rational molecular design by applying the integrated gradients explainable artificial intelligence (XAI) approach for graph neural network models. Models were trained for predicting plasma protein binding, cardiac potassium channel inhibition, passive permeability, and cytochrome P450 inhibition. The proposed methodology highlighted molecular features and structural elements that are in agreement with known pharmacophore motifs, correctly identified property cliffs, and provided insights into unspecific ligand-target interactions. The developed XAI approach is fully open-sourced and can be used by practitioners to train new models on other clinically-relevant endpoints
Artificial intelligence in drug discovery: recent advances and future perspectives
Introduction: Artificial intelligence (AI) has inspired computer-aided drug discovery. The widespread adoption of machine learning, in particular deep learning, in multiple scientific disciplines, and the advances in computing hardware and software, among other factors, continue to fuel this development. Much of the initial skepticism regarding applications of AI in pharmaceutical discovery has started to vanish, consequently benefitting medicinal chemistry.
Areas covered: The current status of AI in chemoinformatics is reviewed. The topics discussed herein include quantitative structure-activity/property relationship and structure-based modeling, de novo molecular design, and chemical synthesis prediction. Advantages and limitations of current deep learning applications are highlighted, together with a perspective on next-generation AI for drug discovery.
Expert opinion: Deep learning-based approaches have only begun to address some fundamental problems in drug discovery. Certain methodological advances, such as message-passing models, spatial-symmetry-preserving networks, hybrid de novo design, and other innovative machine learning paradigms, will likely become commonplace and help address some of the most challenging questions. Open data sharing and model development will play a central role in the advancement of drug discovery with AI.ISSN:1746-0441ISSN:1746-045
Graph Alignment: Robust Data Mining on Structured Objects
In bioinformatics and chemoinformatics, one is often concerned with the study of objects that have a complex and variable internal structure, e.g. protein structures or small organic molecules. The structure of such objects is frequently correlated with their relevant properties (e.g., biological function or toxicity), although the connection between structural features and properties is often complex. Graphs offer an appro-priate formalism for modeling structured objects, as they allow a representation of structural features free from losses. Therefore, we propose a method that allows one to examine relationships among the existence of structural patterns in graphs and their membership in certain classes. The method is designed to be very robust against errors and noise as these occur frequentl
Coloring Molecules with Explainable Artificial Intelligence for Preclinical Relevance Assessment
Graph neural networks are able to solve certain drug discovery tasks such as molecular property prediction and de novo molecule generation. However, these models are considered “black-box” and “hard-to-debug”. This study aimed to improve modeling transparency for rational molecular design by applying the integrated gradients explainable artificial intelligence (XAI) approach for graph neural network models. Models were trained for predicting plasma protein binding, hERG channel inhibition, passive permeability, and cytochrome P450 inhibition. The proposed methodology highlighted molecular features and structural elements that are in agreement with known pharmacophore motifs, correctly identified property cliffs, and provided insights into unspecific ligand–target interactions. The developed XAI approach is fully open-sourced and can be used by practitioners to train new models on other clinically relevant endpoints. © 2021 American Chemical SocietyISSN:1549-9596ISSN:0095-2338ISSN:1520-514