79,303 research outputs found
Substructure Discovery Using Minimum Description Length and Background Knowledge
The ability to identify interesting and repetitive substructures is an
essential component to discovering knowledge in structural data. We describe a
new version of our SUBDUE substructure discovery system based on the minimum
description length principle. The SUBDUE system discovers substructures that
compress the original data and represent structural concepts in the data. By
replacing previously-discovered substructures in the data, multiple passes of
SUBDUE produce a hierarchical description of the structural regularities in the
data. SUBDUE uses a computationally-bounded inexact graph match that identifies
similar, but not identical, instances of a substructure and finds an
approximate measure of closeness of two substructures when under computational
constraints. In addition to the minimum description length principle, other
background knowledge can be used by SUBDUE to guide the search towards more
appropriate substructures. Experiments in a variety of domains demonstrate
SUBDUE's ability to find substructures capable of compressing the original data
and to discover structural concepts important to the domain. Description of
Online Appendix: This is a compressed tar file containing the SUBDUE discovery
system, written in C. The program accepts as input databases represented in
graph form, and will output discovered substructures with their corresponding
value.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
From data towards knowledge: Revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data
Genetic and pharmacological perturbation experiments, such as deleting a gene
and monitoring gene expression responses, are powerful tools for studying
cellular signal transduction pathways. However, it remains a challenge to
automatically derive knowledge of a cellular signaling system at a conceptual
level from systematic perturbation-response data. In this study, we explored a
framework that unifies knowledge mining and data mining approaches towards the
goal. The framework consists of the following automated processes: 1) applying
an ontology-driven knowledge mining approach to identify functional modules
among the genes responding to a perturbation in order to reveal potential
signals affected by the perturbation; 2) applying a graph-based data mining
approach to search for perturbations that affect a common signal with respect
to a functional module, and 3) revealing the architecture of a signaling system
organize signaling units into a hierarchy based on their relationships.
Applying this framework to a compendium of yeast perturbation-response data, we
have successfully recovered many well-known signal transduction pathways; in
addition, our analysis have led to many hypotheses regarding the yeast signal
transduction system; finally, our analysis automatically organized perturbed
genes as a graph reflecting the architect of the yeast signaling system.
Importantly, this framework transformed molecular findings from a gene level to
a conceptual level, which readily can be translated into computable knowledge
in the form of rules regarding the yeast signaling system, such as "if genes
involved in MAPK signaling are perturbed, genes involved in pheromone responses
will be differentially expressed"
Artifact Lifecycle Discovery
Artifact-centric modeling is a promising approach for modeling business
processes based on the so-called business artifacts - key entities driving the
company's operations and whose lifecycles define the overall business process.
While artifact-centric modeling shows significant advantages, the overwhelming
majority of existing process mining methods cannot be applied (directly) as
they are tailored to discover monolithic process models. This paper addresses
the problem by proposing a chain of methods that can be applied to discover
artifact lifecycle models in Guard-Stage-Milestone notation. We decompose the
problem in such a way that a wide range of existing (non-artifact-centric)
process discovery and analysis methods can be reused in a flexible manner. The
methods presented in this paper are implemented as software plug-ins for ProM,
a generic open-source framework and architecture for implementing process
mining tools
Structure induction by lossless graph compression
This work is motivated by the necessity to automate the discovery of
structure in vast and evergrowing collection of relational data commonly
represented as graphs, for example genomic networks. A novel algorithm, dubbed
Graphitour, for structure induction by lossless graph compression is presented
and illustrated by a clear and broadly known case of nested structure in a DNA
molecule. This work extends to graphs some well established approaches to
grammatical inference previously applied only to strings. The bottom-up graph
compression problem is related to the maximum cardinality (non-bipartite)
maximum cardinality matching problem. The algorithm accepts a variety of graph
types including directed graphs and graphs with labeled nodes and arcs. The
resulting structure could be used for representation and classification of
graphs.Comment: 10 pages, 7 figures, 2 tables published in Proceedings of the Data
Compression Conference, 200
Distributed Computation as Hierarchy
This paper presents a new distributed computational model of distributed
systems called the phase web that extends V. Pratt's orthocurrence relation
from 1986. The model uses mutual-exclusion to express sequence, and a new kind
of hierarchy to replace event sequences, posets, and pomsets. The model
explicitly connects computation to a discrete Clifford algebra that is in turn
extended into homology and co-homology, wherein the recursive nature of objects
and boundaries becomes apparent and itself subject to hierarchical recursion.
Topsy, a programming environment embodying the phase web, is available from
www.cs.auc.dk/topsy.Comment: 16 pages, 3 figure
Building XML data warehouse based on frequent patterns in user queries
[Abstract]: With the proliferation of XML-based data sources available across the Internet, it is increasingly important to provide users with a data warehouse of XML data sources to facilitate decision-making processes. Due to the extremely large amount of XML data available on web, unguided warehousing of XML data turns out to be highly costly and usually cannot well accommodate the users’ needs in XML data acquirement. In this paper, we propose an approach to materialize XML data warehouses based on frequent query patterns discovered from historical queries issued by users. The schemas of integrated XML documents in the warehouse are built using these frequent query patterns represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical clustering technique, the integration approach in the data warehouse is flexible with respect to obtaining and maintaining XML documents. Experiments show that the overall processing of the same queries issued against the global schema become much efficient by using the XML data warehouse built than by directly searching the multiple data sources
- …