64,608 research outputs found
Mining Rooted Ordered Trees under Subtree Homeomorphism
Mining frequent tree patterns has many applications in different areas such
as XML data, bioinformatics and World Wide Web. The crucial step in frequent
pattern mining is frequency counting, which involves a matching operator to
find occurrences (instances) of a tree pattern in a given collection of trees.
A widely used matching operator for tree-structured data is subtree
homeomorphism, where an edge in the tree pattern is mapped onto an
ancestor-descendant relationship in the given tree. Tree patterns that are
frequent under subtree homeomorphism are usually called embedded patterns. In
this paper, we present an efficient algorithm for subtree homeomorphism with
application to frequent pattern mining. We propose a compact data-structure,
called occ, which stores only information about the rightmost paths of
occurrences and hence can encode and represent several occurrences of a tree
pattern. We then define efficient join operations on the occ data-structure,
which help us count occurrences of tree patterns according to occurrences of
their proper subtrees. Based on the proposed subtree homeomorphism method, we
develop an effective pattern mining algorithm, called TPMiner. We evaluate the
efficiency of TPMiner on several real-world and synthetic datasets. Our
extensive experiments confirm that TPMiner always outperforms well-known
existing algorithms, and in several cases the improvement with respect to
existing algorithms is significant.Comment: This paper is accepted in the Data Mining and Knowledge Discovery
journal
(http://www.springer.com/computer/database+management+%26+information+retrieval/journal/10618
Frequent Subgraph Mining in Outerplanar Graphs
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we define the class of so called tenuous outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for tenuous outerplanar graphs that works in incremental polynomial time, and evaluate the algorithm empirically on the NCI molecular graph dataset
Frequent Subgraph Mining in Outerplanar Graphs
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we define the class of so called tenuous outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for tenuous outerplanar graphs that works in incremental polynomial time, and evaluate the algorithm empirically on the NCI molecular graph dataset
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
This paper introduces new algorithms and data structures for quick counting
for machine learning datasets. We focus on the counting task of constructing
contingency tables, but our approach is also applicable to counting the number
of records in a dataset that match conjunctive queries. Subject to certain
assumptions, the costs of these operations can be shown to be independent of
the number of records in the dataset and loglinear in the number of non-zero
entries in the contingency table. We provide a very sparse data structure, the
ADtree, to minimize memory use. We provide analytical worst-case bounds for
this structure for several models of data distribution. We empirically
demonstrate that tractably-sized data structures can be produced for large
real-world datasets by (a) using a sparse tree structure that never allocates
memory for counts of zero, (b) never allocating memory for counts that can be
deduced from other counts, and (c) not bothering to expand the tree fully near
its leaves. We show how the ADtree can be used to accelerate Bayes net
structure finding algorithms, rule learning algorithms, and feature selection
algorithms, and we provide a number of empirical results comparing ADtree
methods against traditional direct counting approaches. We also discuss the
possible uses of ADtrees in other machine learning methods, and discuss the
merits of ADtrees in comparison with alternative representations such as
kd-trees, R-trees and Frequent Sets.Comment: See http://www.jair.org/ for any accompanying file
- …