7 research outputs found

    The Need for a New Generation of Substructure Searching Software

    Get PDF
    Advances in synthetic chemistry mean that the molecules now synthesized include increasingly complex entities with mechanical bonds or extensive frameworks. For these complex molecular and supramolecular species, single-crystal X-ray crystallography has proved to be the optimal technique for determining full three-dimensional structures in the solid state. These structures are curated and placed in structural databases, the most comprehensive of which (for organic and metallo-organic structures) is the Cambridge Structural Database (CSD). A question of increasing importance is how users can search such databases effectively for these structures. In this Opinion we highlight some of the classes of complex molecules and supramolecules and the challenges associated with searching for them. We develop the idea of substructure searches that involve topological searches as well as searches for molecular fragments, and propose significant enhancements to substructure-search programs that are both achievable and highly beneficial for both the database user community and the broader chemistry community

    Efficient access methods for very large distributed graph databases

    Get PDF
    Subgraph searching is an essential problem in graph databases, but it is also challenging due to the involved subgraph isomorphism NP-Complete sub-problem. Filter-Then-Verify (FTV) methods mitigate performance overheads by using an index to prune out graphs that do not fit the query in a filtering stage, reducing the number of subgraph isomorphism evaluations in a subsequent verification stage. Subgraph searching has to be applied to very large databases (tens of millions of graphs) in real applications such as molecular substructure searching. Previous surveys have identified the FTV solutions GraphGrepSX (GGSX) and CT-Index as the best ones for large databases (thousands of graphs), however they cannot reach reasonable performance on very large ones (tens of millions graphs). This paper proposes a generic approach for the distributed implementation of FTV solutions. Besides, three previous methods that improve the performance of GGSX and CT-Index are adapted to be executed in clusters. The evaluation shows how the achieved solutions provide a great performance improvement (between 70% and 90% of filtering time reduction) in a centralized configuration and how they may be used to achieve efficient subgraph searching over very large databases in cluster configurationsThis work has been co-funded by the Ministerio de Economía y Competitividad of the Spanish government, and by Mestrelab Research S.L. through the project NEXTCHROM (RTC-2015-3812-2) of the call Retos-Colaboración of the program Programa Estatal de Investigación, Desarrollo e Innovación Orientada a los Retos de la Sociedad. The authors wish to thank the financial support provided by Xunta de Galicia under the Project ED431B 2018/28S

    Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2

    No full text
    <p>Abstract</p> <p>Background</p> <p>Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.</p> <p>Results</p> <p>In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1235 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.</p> <p>Conclusions</p> <p>The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.</p

    Graph-Grammatiken zur Suche und Klassifikation von molekularen Strukturen

    Get PDF
    We define a graph rewriting system that is easily understandable by humans, but rich enough to allow very general queries to molecule databases. It is based on the substitution of a single node in a nodeand edge-labeled graph by an arbitrary graph, explicitly assigning new endpoints to the edges incident to the replaced node. For these graph rewriting systems, we are interested in the subgraph-matching problem. We show that the problem is NP-complete, even on graphs that are stars. As a positive result, we give an algorithm which is polynomial if both rules and the query graph have bounded degree and bounded cut size. We demonstrate that molecular graphs of practically relevant molecules in drug discovery conform with this property. The algorithm is not a fixed-parameter algorithm. Indeed, we show that the problem is W[1]-hard on trees with the degree as the parameter. Different approaches for learning graph grammars are presented. It is shown to what extent they can be used for substructure search or classification of molecules. Different aspects of production are considered, e.g., that a special structure is used, which was recognized in presented molecules. The results of the presented approaches are compared and evaluated with different machine learning approaches.In dieser Arbeit wird ein Modell zum Umschreiben von Graphen präsentiert, das sowohl einfach und intuitiv ist, aber mächtig genug, um allgemeine Abfragen in Moleküldatenbanken zu ermöglichen. Das Modell basiert auf der Ersetzung eines einzelnen Knotens in einem knoten- und kantengelabelten Graphen ohne Schleifen. Dies ermöglicht es, ein Modell vorzustellen, bei dem die Regeln vorgeben, wie die Kanten umgelenkt werden müssen, die einen Endpunkt verlieren, weil ein Knoten des Graphen entfernt wurde. Eine Regel ersetzt hierbei einen einzelnen Knoten mit einem bestimmten Label und deren inzidenten Kanten, die ebenfalls spezifische Label aufweisen, durch einen Teilgraphen. Hierbei spielt das Subgraphmatching eine entscheidende Rolle. Wir zeigen, dass das Problem auch auf Sterngraphen NP-vollständig ist. Es wird ein Algorithmus vorgestellt, der eine polynomielle Laufzeit aufweist, falls die zugrundeliegenden Regeln einen begrenzten Grad aufweisen und der Anfragegraph eine begrenzte Größe des minimalen Schnittes inne hat. Wir zeigen, dass das Modell keinen FPT-Algorithmus darstellt und dass das Problem W[1]-schwer auf Bäumen ist, mit dem Grad als Parameter. Zusätzlich werden unterschiedliche Ansätze zum Lernen von Graph-Grammatiken vorgestellt. Anschließend wird aufgezeigt, in welchem Umfang diese zur Substruktursuche oder Klassifizierung von Molekülen genutzt werden können und welche Ergebnisse sie hierbei erzielen. Die Ergebnisse der vorgestellten Ansätze werden mit unterschiedlichen Ansätzen des maschinellen Lernens miteinander verglichen und bewertet

    Development of an R-language tool to enhance in silico drug discovery from ethnopharmacologically used plant sources: The example of androgenetic alopecia.

    Get PDF
    Herbal medicines have been, are, and will always be a major asset in drug discovery. A freely available, user-friendly suite of computation tools has been developed to enhance a variety of in silico drug discovery processes. The investigation focused on discovering novel leads for the treatment of androgenetic alopecia (AGA) from natural sources already used ethnopharmacologically. A set of twenty-two R code snippets (termed ‘Tool Services’) was created. Tool Services focus on collecting, manipulating, and analysing data from a variety of sources. These sources include general information, ethnopharmacology, chemistry, pharmacology, targets, diseases, pathways and predictive QSAR modelling data. Sixty-nine plants with established use in AGA were studied and their 2,157 phytochemical ingredients recorded. Taxonomically, more than a third of these plants belong to four families that share many similarities in terms of DNA and phytochemical content. Structural similarity studies on 34 phytochemicals chosen based on their frequency of occurrence in the plants revealed similarities between them and with UV-protectants, vascular protectants and anti-inflammatory agents. Seven drugs currently marketed as monotherapy of AGA were structurally compared against our phytochemicals and were also assessed for drug-drug interactions, side effects, adverse drug reactions, and their metabolic fate. During this phase of the study, studies of alopecia as a side effect and as an adverse drug reaction of drugs were also prepared. A study on 48 targets revealed a strong relation with pathways that are implicated in hair follicle growth and development. More than half of these genes were linked to diseases such as hypotrichosis. Finally, the actual binding sites of these targets and the binding affinities of chemicals for these targets were revealed. Undoubtedly, the androgen receptor (AR) is one of the most studied target in AGA. A QSAR classification model was built for AR using 206 active and 1600 inactive compounds in terms of AR antagonism, utilising both Random Forest and Naïve Bayes algorithms

    Applications and Variations of the Maximum Common Subgraph for the Determination of Chemical Similarity

    Get PDF
    The Maximum Common Substructure (MCS), along with numerous graph theory techniques, has been used widely in chemoinformatics. A topic which has been studied at Sheffield is the hyperstructure concept - a chemical definition of a superstructure, which represents the graph theoretic union of several molecules. This technique however, has been poorly studied in the context of similarity-based virtual screening. Most hyperstructure literature to date has focused on either construction methodology, or property prediction on small datasets of compounds. The work in this thesis is divided into two parts. The first part describes a method for constructing hyperstructures, and then describes the application of a hyperstructure in similarity searching in large compound datasets, comparing it with extended connectivity fingerprint and MCS similarity. Since hyperstructures performed significantly worse than fingerprints, additional work is described concerning various weighting schemes of hyperstructures. Due to the poor performance of hyperstructure and MCS screening compared to fingerprints, it was questioned whether the type of maximum common substructure algorithm and type had an influence. A series of MCS algorithms and types were compared for both speed, MCS size, and virtual screening ability. A topologically-constrained variant of the MCS was found to be competitive with fingerprints, and fusion of the two techniques overall improved active compound recall

    Développement de méthodes et d’outils chémoinformatiques pour l’analyse et la comparaison de chimiothèques

    Get PDF
    Some news areas in biology ,chemistry and computing interface, have emerged in order to respond the numerous problematics linked to the drug research. This is what this thesis is all about, as an interface gathered under the banner of chimocomputing. Though, new on a human scale, these domains are nevertheless, already an integral part of the drugs and medicines research. As the Biocomputing, his fundamental pillar remains storage, representation, management and the exploitation through computing of chemistry data. Chimocomputing is now mostly used in the upstream phases of drug research. Combining methods from various fields ( chime, computing, maths, apprenticeship, statistics, etc…) allows the implantation of computing tools adapted to the specific problematics and data of chime such as chemical database storage, understructure research, data visualisation or physoco-chimecals and biologics properties prediction.In that multidisciplinary frame, the work done in this thesis pointed out two important aspects, both related to chimocomputing : (1) The new methods development allowing to ease the visualization, analysis and interpretation of data related to set of the molecules, currently known as chimocomputing and (2) the computing tools development enabling the implantation of these methods.De nouveaux domaines ont vu le jour, à l’interface entre biologie, chimie et informatique, afin de répondre aux multiples problématiques liées à la recherche de médicaments. Cette thèse se situe à l’interface de plusieurs de ces domaines, regroupés sous la bannière de la chémo-informatique. Récent à l’échelle humaine, ce domaine fait néanmoins déjà partie intégrante de la recherche pharmaceutique. De manière analogue à la bioinformatique, son pilier fondateur reste le stockage, la représentation, la gestion et l’exploitation par ordinateur de données provenant de la chimie. La chémoinformatique est aujourd’hui utilisée principalement dans les phases amont de la recherche de médicaments. En combinant des méthodes issues de différents domaines (chimie, informatique, mathématique, apprentissage, statistiques, etc.), elle permet la mise en oeuvre d’outils informatiques adaptés aux problématiques et données spécifiques de la chimie, tels que le stockage de l’information chimique en base de données, la recherche par sous-structure, la visualisation de données, ou encore la prédiction de propriétés physico-chimiques et biologiques.Dans ce cadre pluri-disciplinaire, le travail présenté dans cette thèse porte sur deux aspects importants liés à la chémoinformatique : (1) le développement de nouvelles méthodes permettant de faciliter la visualisation, l’analyse et l’interprétation des données liées aux ensembles de molécules, plus communément appelés chimiothèques, et (2) le développement d’outils informatiques permettant de mettre en oeuvre ces méthodes
    corecore