2,282 research outputs found

    Frequent Subgraph Mining in Outerplanar Graphs

    Get PDF
    In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we define the class of so called tenuous outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for tenuous outerplanar graphs that works in incremental polynomial time, and evaluate the algorithm empirically on the NCI molecular graph dataset

    Frequent Subgraph Mining in Outerplanar Graphs

    Get PDF
    In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we define the class of so called tenuous outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for tenuous outerplanar graphs that works in incremental polynomial time, and evaluate the algorithm empirically on the NCI molecular graph dataset

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids

    Efficient mining of discriminative molecular fragments

    Get PDF
    Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset

    Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

    Full text link
    This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.Comment: See http://www.jair.org/ for any accompanying file

    DiffNodesets: An Efficient Structure for Fast Mining Frequent Itemsets

    Full text link
    Mining frequent itemsets is an essential problem in data mining and plays an important role in many data mining applications. In recent years, some itemset representations based on node sets have been proposed, which have shown to be very efficient for mining frequent itemsets. In this paper, we propose DiffNodeset, a novel and more efficient itemset representation, for mining frequent itemsets. Based on the DiffNodeset structure, we present an efficient algorithm, named dFIN, to mining frequent itemsets. To achieve high efficiency, dFIN finds frequent itemsets using a set-enumeration tree with a hybrid search strategy and directly enumerates frequent itemsets without candidate generation under some case. For evaluating the performance of dFIN, we have conduct extensive experiments to compare it against with existing leading algorithms on a variety of real and synthetic datasets. The experimental results show that dFIN is significantly faster than these leading algorithms.Comment: 22 pages, 13 figure

    Tree model guided (TMG) enumeration as the basis for mining frequent patterns from XML documents

    Full text link
    University of Technology, Sydney. Faculty of Information Technology.Association mining consists of two important problems, namely frequent patterns discovery and rule construction. The former task is considered to be a more challenging problem to solve. Because of its importance and application in a number of data mining tasks, it has become the focus of many studies. A substantial amount of research has gone into the development of efficient algorithms for mining patterns from large structured or relational data. Compared with the fruitful achievements in mining structured data, mining in the semi-structured world still remains at a preliminary stage. The most popular representative of the semi-structured data is XML. Mining frequent patterns from XML poses more challenges in comparison to mining frequent patterns from relational data because XML is a tree-structured data and has an ordered data context. Moreover, XML data in general is larger in data size due to richer contents and more meta-data. Dealing with XML, thus involves greater unprecedented complexity in comparison to mining relational data. Mining frequent patterns from XML can be recast as mining frequent tree structures from a database of XML documents. The increase of XML data and the need for mining semi-structured data has sparked a lot of interest in finding frequent rooted trees in forests. In this thesis, we aim to develop a framework to mine frequent patterns from XML documents. The framework utilizes a structure-guided enumeration approach, Tree Model Guided (TMG), for efficient enumeration of tree structure and it makes use of novel structures for fast enumeration and frequency counting. By utilizing a novel array-based structure, an embedded list (EL), the framework offers a simple sequencelike tree enumeration technique. The effectiveness and extendibility of the framework is demonstrated in that it can be utilized not only for enumerating ordered subtrees but also for enumerating unordered subtrees and subsequences. Furthermore, the framework tackles the unprecedented complexity in mining frequent tree-structured patterns by generating only valid candidates with non-zero frequency count and employing a constraint-driven approach. Our experimental studies comparing the proposed framework with the state-of-the-art algorithms demonstrate the effectiveness and the efficiency of the proposed framework

    Tree model guided candidate generation for mining frequent subtrees from XML

    Get PDF
    Due to the inherent flexibilities in both structure and semantics, XML association rules mining faces few challenges, such as: a more complicated hierarchical data structure and ordered data context. Mining frequent patterns from XML documents can be recast as mining frequent tree structures from a database of XML documents. In this study, we model a database of XML documents as a database of rooted labeled ordered subtrees. In particular, we are mainly coneerned with mining frequent induced and embedded ordered subtrees. Our main contributions arc as follows. We describe our unique embedding list representation of the tree structure, which enables efficient implementation ofour Tree Model Guided (TMG) candidate generation. TMG is an optimal, non-redundant enumeration strategy which enumerates all the valid candidates that conform to the structural aspects of the data. We show through a mathematical model and experiments that TMG has better complexity compared to the commonly used join approach. In this paper, we propose two algorithms, MB3Miner and iMB3-Miner. MB3-Miner mines embedded subtrees. iMB3-Miner mines induced and/or embedded subtrees by using the maximum level of embedding constraint. Our experiments with both synthetic and real datasets against two well known algorithms for mining induced and embedded subtrees, demonstrate the effeetiveness and the efficiency of the proposed techniques

    A Survey on Graph Kernels

    Get PDF
    Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner's guide to kernel-based graph classification
    corecore