9,080 research outputs found

    Substructure Discovery Using Minimum Description Length and Background Knowledge

    Full text link
    The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUE's ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain. Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl

    VoG: Summarizing and Understanding Large Graphs

    Get PDF
    How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph.Comment: SIAM International Conference on Data Mining (SDM) 201

    Inductive queries for a drug designing robot scientist

    Get PDF
    It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments

    {VoG}: {Summarizing} and Understanding Large Graphs

    Get PDF
    How can we succinctly describe a million-node graph with a few simple sentences? How can we measure the "importance" of a set of discovered subgraphs in a large graph? These are exactly the problems we focus on. Our main ideas are to construct a "vocabulary" of subgraph-types that often occur in real graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the most succinct description of a graph in terms of this vocabulary. We measure success in a well-founded way by means of the Minimum Description Length (MDL) principle: a subgraph is included in the summary if it decreases the total description length of the graph. Our contributions are three-fold: (a) formulation: we provide a principled encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop \method, an efficient method to minimize the description cost, and (c) applicability: we report experimental results on multi-million-edge real graphs, including Flickr and the Notre Dame web graph

    A Graph Theoretic Approach for Object Shape Representation in Compositional Hierarchies Using a Hybrid Generative-Descriptive Model

    Full text link
    A graph theoretic approach is proposed for object shape representation in a hierarchical compositional architecture called Compositional Hierarchy of Parts (CHOP). In the proposed approach, vocabulary learning is performed using a hybrid generative-descriptive model. First, statistical relationships between parts are learned using a Minimum Conditional Entropy Clustering algorithm. Then, selection of descriptive parts is defined as a frequent subgraph discovery problem, and solved using a Minimum Description Length (MDL) principle. Finally, part compositions are constructed by compressing the internal data representation with discovered substructures. Shape representation and computational complexity properties of the proposed approach and algorithms are examined using six benchmark two-dimensional shape image datasets. Experiments show that CHOP can employ part shareability and indexing mechanisms for fast inference of part compositions using learned shape vocabularies. Additionally, CHOP provides better shape retrieval performance than the state-of-the-art shape retrieval methods.Comment: Paper : 17 pages. 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III, pp 566-581. Supplementary material can be downloaded from http://link.springer.com/content/esm/chp:10.1007/978-3-319-10578-9_37/file/MediaObjects/978-3-319-10578-9_37_MOESM1_ESM.pd

    The Structure of Halos: Implications for Group and Cluster Cosmology

    Full text link
    The dark matter halo mass function is a key repository of cosmological information over a wide range of mass scales, from individual galaxies to galaxy clusters. N-body simulations have established that the friends-of-friends (FOF) mass function has a universal form to a surprising level of accuracy (< 10%). The high-mass tail of the mass function is exponentially sensitive to the amplitude of the initial density perturbations, the mean matter density parameter, Omega_m, and to the dark energy controlled late-time evolution of the density field. Observed group and cluster masses, however, are usually stated in terms of a spherical overdensity (SO) mass which does not map simply to the FOF mass. Additionally, the widely used halo models of structure formation -- and halo occupancy distribution descriptions of galaxies within halos -- are often constructed exploiting the universal form of the FOF mass function. This again raises the question of whether FOF halos can be simply related to the notion of a spherical overdensity mass. By employing results from Monte Carlo realizations of ideal Navarro-Frenk-White (NFW) halos and N-body simulations, we study the relationship between the two definitions of halo mass. We find that the vast majority of halos (80-85%) in the mass-range 10^{12.5}-10^{15.5} M_sun/h indeed allow for an accurate mapping between the two definitions (~ 5%), but only if the halo concentrations are known. Nonisolated halos fall into two broad classes: those with complex substructure that are poor fits to NFW profiles and those ``bridged'' by the (isodensity-based)FOF algorithm. A closer investigation of the bridged halos reveals that the fraction of these halos and their satellite mass distribution is cosmology dependent. (abridged)Comment: Submitted to Ap
    • …
    corecore