1,280 research outputs found

    Significant Subgraph Mining with Multiple Testing Correction

    Full text link
    The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International Conference on Data Mining (SDM15

    Mining significant substructure pairs for interpreting polypharmacology in drug-target network.

    Get PDF
    A current key feature in drug-target network is that drugs often bind to multiple targets, known as polypharmacology or drug promiscuity. Recent literature has indicated that relatively small fragments in both drugs and targets are crucial in forming polypharmacology. We hypothesize that principles behind polypharmacology are embedded in paired fragments in molecular graphs and amino acid sequences of drug-target interactions. We developed a fast, scalable algorithm for mining significantly co-occurring subgraph-subsequence pairs from drug-target interactions. A noteworthy feature of our approach is to capture significant paired patterns of subgraph-subsequence, while patterns of either drugs or targets only have been considered in the literature so far. Significant substructure pairs allow the grouping of drug-target interactions into clusters, covering approximately 75% of interactions containing approved drugs. These clusters were highly exclusive to each other, being statistically significant and logically implying that each cluster corresponds to a distinguished type of polypharmacology. These exclusive clusters cannot be easily obtained by using either drug or target information only but are naturally found by highlighting significant substructure pairs in drug-target interactions. These results confirm the effectiveness of our method for interpreting polypharmacology in drug-target network

    Clustering files of chemical structures using the Szekely-Rizzo generalization of Ward's method

    Get PDF
    Ward's method is extensively used for clustering chemical structures represented by 2D fingerprints. This paper compares Ward clusterings of 14 datasets (containing between 278 and 4332 molecules) with those obtained using the Szekely–Rizzo clustering method, a generalization of Ward's method. The clusters resulting from these two methods were evaluated by the extent to which the various classifications were able to group active molecules together, using a novel criterion of clustering effectiveness. Analysis of a total of 1400 classifications (Ward and Székely–Rizzo clustering methods, 14 different datasets, 5 different fingerprints and 10 different distance coefficients) demonstrated the general superiority of the Székely–Rizzo method. The distance coefficient first described by Soergel performed extremely well in these experiments, and this was also the case when it was used in simulated virtual screening experiments

    Cheminformatics Models for Inhibitors of Schistosoma mansoni

    Get PDF
    Schistosomiasis is a neglected tropical disease caused by a parasite Schistosoma mansoni and affects over 200 million annually. There is an urgent need to discover novel therapeutic options to control the disease with the recent emergence of drug resistance. The multifunctional protein, thioredoxin glutathione reductase (TGR), an essential enzyme for the survival of the pathogen in the redox environment has been actively explored as a potential drug target. The recent availability of small-molecule screening datasets against this target provides a unique opportunity to learn molecular properties and apply computational models for discovery of activities in large molecular libraries. Such a prioritisation approach could have the potential to reduce the cost of failures in lead discovery. A supervised learning approach was employed to develop a cost sensitive classification model to evaluate the biological activity of the molecules. Random forest was identified to be the best classifier among all the classifiers with an accuracy of around 80 percent. Independent analysis using a maximally occurring substructure analysis revealed 10 highly enriched scaffolds in the actives dataset and their docking against was also performed. We show that a combined approach of machine learning and other cheminformatics approaches such as substructure comparison and molecular docking is efficient to prioritise molecules from large molecular datasets

    Kernel Functions for Graph Classification

    Get PDF
    Graphs are information-rich structures, but their complexity makes them difficult to analyze. Given their broad and powerful representation capacity, the classification of graphs has become an intense area of research. Many established classifiers represent objects with vectors of explicit features. When the number of features grows, however, these vector representations suffer from typical problems of high dimensionality such as overfitting and high computation time. This work instead focuses on using kernel functions to map graphs into implicity defined spaces that avoid the difficulties of vector representations. The introduction of kernel classifiers has kindled great interest in kernel functions for graph data. By using kernels the problem of graph classification changes from finding a good classifier to finding a good kernel function. This work explores several novel uses of kernel functions for graph classification. The first technique is the use of structure based features to add structural information to the kernel function. A strength of this approach is the ability to identify specific structure features that contribute significantly to the classification process. Discriminative structures can then be passed off to domain-specific researchers for additional analysis. The next approach is the use of wavelet functions to represent graph topology as simple real-valued features. This approach achieves order-of-magnitude decreases in kernel computation time by eliminating costly topological comparisons, while retaining competitive classification accuracy. Finally, this work examines the use of even simpler graph representations and their utility for classification. The models produced from the kernel functions presented here yield excellent performance with respect to both efficiency and accuracy, as demonstrated in a variety of experimental studies

    Kernelized Hashcode Representations for Relation Extraction

    Full text link
    Kernel methods have produced state-of-the-art results for a number of NLP tasks such as relation extraction, but suffer from poor scalability due to the high cost of computing kernel similarities between natural language structures. A recently proposed technique, kernelized locality-sensitive hashing (KLSH), can significantly reduce the computational cost, but is only applicable to classifiers operating on kNN graphs. Here we propose to use random subspaces of KLSH codes for efficiently constructing an explicit representation of NLP structures suitable for general classification methods. Further, we propose an approach for optimizing the KLSH model for classification problems by maximizing an approximation of mutual information between the KLSH codes (feature vectors) and the class labels. We evaluate the proposed approach on biomedical relation extraction datasets, and observe significant and robust improvements in accuracy w.r.t. state-of-the-art classifiers, along with drastic (orders-of-magnitude) speedup compared to conventional kernel methods.Comment: To appear in the proceedings of conference, AAAI-1

    Analysis of Multitarget Activities and Assay Interference Characteristics of Pharmaceutically Relevant Compounds

    Get PDF
    The availability of large amounts of data in public repositories provide a useful source of knowledge in the field of drug discovery. Given the increasing sizes of compound databases and volumes of activity data, computational data mining can be used to study different characteristics and properties of compounds on a large scale. One of the major source of identification of new compounds in early phase of drug discovery is high-throughput screening where millions of compounds are tested against many targets. The screening data provides opportunities to assess activity profiles of compounds. This thesis aims at systematically mining activity data from publicly available sources in order to study the nature of growth of bioactive compounds, analyze multitarget activities and assay interference characteristics of pharmaceutically relevant compounds in context of polypharmacology. In the first study, growth of bioactive compounds against five major target families is monitored over time and compound-scaffold-CSK (cyclic skeleton) hierarchy is applied to investigate structural diversity of active compounds and topological diversity of their scaffolds. The next part of the thesis is based on the analysis of screening data. Initially, extensively assayed compounds are mined from the PubChem database and promiscuity of these compounds is assessed by taking assay frequencies into account. Next, DCM (dark chemical matter) or consistently inactive compounds that have been extensively tested are systematically extracted and their analog relationships with bioactive compounds are determined in order to derive target hypotheses for DCM. Further, PAINS (pan-assay interference compounds) are identified in the extensively tested set of compounds using substructure filters and their assay interference characteristics are studied. Finally, the limitations of PAINS filters are addressed using machine learning models that can distinguish between promiscuous and DCM PAINS. Structural context dependence of PAINS activities is studied by assessing predictions through feature weighting and mapping

    Structural Pattern Recognition for Chemical-Compound Virtual Screening

    Get PDF
    Les molècules es configuren de manera natural com a xarxes, de manera que són ideals per estudiar utilitzant les seves representacions gràfiques, on els nodes representen àtoms i les vores representen els enllaços químics. Una alternativa per a aquesta representació directa és el gràfic reduït ampliat, que resumeix les estructures químiques mitjançant descripcions de nodes de tipus farmacòfor per codificar les propietats moleculars rellevants. Un cop tenim una manera adequada de representar les molècules com a gràfics, hem de triar l’eina adequada per comparar-les i analitzar-les. La distància d'edició de gràfics s'utilitza per resoldre la concordança de gràfics tolerant als errors; aquesta metodologia calcula la distància entre dos gràfics determinant el nombre mínim de modificacions necessàries per transformar un gràfic en l’altre. Aquestes modificacions (conegudes com a operacions d’edició) tenen associat un cost d’edició (també conegut com a cost de transformació), que s’ha de determinar en funció del problema. Aquest estudi investiga l’eficàcia d’una comparació molecular basada només en gràfics que utilitza gràfics reduïts ampliats i distància d’edició de gràfics com a eina per a aplicacions de cribratge virtual basades en lligands. Aquestes aplicacions estimen la bioactivitat d'una substància química que utilitza la bioactivitat de compostos similars. Una part essencial d’aquest estudi es centra en l’ús d’aprenentatge automàtic i tècniques de processament del llenguatge natural per optimitzar els costos de transformació utilitzats en les comparacions moleculars amb la distància d’edició de gràfics.Las moléculas tienen la forma natural de redes, lo que las hace ideales para estudiar mediante el empleo de sus representaciones gráficas, donde los nodos representan los átomos y los bordes representan los enlaces químicos. Una alternativa para esta representación sencilla es el gráfico reducido extendido, que resume las estructuras químicas utilizando descripciones de nodos de tipo farmacóforo para codificar las propiedades moleculares relevantes. Una vez que tenemos una forma adecuada de representar moléculas como gráficos, debemos elegir la herramienta adecuada para compararlas y analizarlas. La distancia de edición de gráficos se utiliza para resolver la coincidencia de gráficos tolerante a errores; esta metodología estima una distancia entre dos gráficos determinando el número mínimo de modificaciones necesarias para transformar un gráfico en el otro. Estas modificaciones (conocidas como operaciones de edición) tienen un costo de edición (también conocido como costo de transformación) asociado, que debe determinarse en función del problema. Este estudio investiga la efectividad de una comparación molecular basada solo en gráficos que emplea gráficos reducidos extendidos y distancia de edición de gráficos como una herramienta para aplicaciones de detección virtual basadas en ligandos. Estas aplicaciones estiman la bioactividad de una sustancia química empleando la bioactividad de compuestos similares. Una parte esencial de este estudio se centra en el uso de técnicas de procesamiento de lenguaje natural y aprendizaje automático para optimizar los costos de transformación utilizados en las comparaciones moleculares con la distancia de edición de gráficos.Molecules are naturally shaped as networks, making them ideal for studying by employing their graph representations, where nodes represent atoms and edges represent the chemical bonds. An alternative for this straightforward representation is the extended reduced graph, which summarizes the chemical structures using pharmacophore-type node descriptions to encode the relevant molecular properties. Once we have a suitable way to represent molecules as graphs, we need to choose the right tool to compare and analyze them. Graph edit distance is used to solve the error-tolerant graph matching; this methodology estimates a distance between two graphs by determining the minimum number of modifications required to transform one graph into the other. These modifications (known as edit operations) have an edit cost (also known as transformation cost) associated, which must be determined depending on the problem. This study investigates the effectiveness of a graph-only driven molecular comparison employing extended reduced graphs and graph edit distance as a tool for ligand-based virtual screening applications. Those applications estimate the bioactivity of a chemical employing the bioactivity of similar compounds. An essential part of this study focuses on using machine learning and natural language processing techniques to optimize the transformation costs used in the molecular comparisons with the graph edit distance. Overall, this work shows a framework that combines graph reduction and comparison with optimization tools and natural language processing to identify bioactivity similarities in a structurally diverse group of molecules. We confirm the efficiency of this framework with several chemoinformatic tests applied to regression and classification problems over different publicly available datasets
    corecore