8 research outputs found
Query-Constraint-Based Mining of Association Rules for Exploratory Analysis of Clinical Datasets in the National Sleep Research Resource
Background: Association Rule Mining (ARM) has been widely used by biomedical researchers to perform exploratory data analysis and uncover potential relationships among variables in biomedical datasets. However, when biomedical datasets are high-dimensional, performing ARM on such datasets will yield a large number of rules, many of which may be uninteresting. Especially for imbalanced datasets, performing ARM directly would result in uninteresting rules that are dominated by certain variables that capture general characteristics.
Methods: We introduce a query-constraint-based ARM (QARM) approach for exploratory analysis of multiple, diverse clinical datasets in the National Sleep Research Resource (NSRR). QARM enables rule mining on a subset of data items satisfying a query constraint. We first perform a series of data-preprocessing steps including variable selection, merging semantically similar variables, combining multiple-visit data, and data transformation. We use Top-k Non-Redundant (TNR) ARM algorithm to generate association rules. Then we remove general and subsumed rules so that unique and non-redundant rules are resulted for a particular query constraint.
Results: Applying QARM on five datasets from NSRR obtained a total of 2517 association rules with a minimum confidence of 60% (using top 100 rules for each query constraint). The results show that merging similar variables could avoid uninteresting rules. Also, removing general and subsumed rules resulted in a more concise and interesting set of rules.
Conclusions: QARM shows the potential to support exploratory analysis of large biomedical datasets. It is also shown as a useful method to reduce the number of uninteresting association rules generated from imbalanced datasets. A preliminary literature-based analysis showed that some association rules have supporting evidence from biomedical literature, while others without literature-based evidence may serve as the candidates for new hypotheses to explore and investigate. Together with literature-based evidence, the association rules mined over the NSRR clinical datasets may be used to support clinical decisions for sleep-related problems
GRAPE: Parallel Graph Query Engine
The need for graph computations is evident in a multitude of use cases. To support
computations on large-scale graphs, several parallel systems have been developed.
However, existing graph systems require users to recast algorithms into new models,
which makes parallel graph computations as a privilege to experienced users only.
Moreover, real world applications often require much more complex graph processing
workflows than previously evaluated. In response to these challenges, the thesis
presents GRAPE, a distributed graph computation system, shipped with various applications
for social network analysis, social media marketing and functional dependencies
on graphs.
Firstly, the thesis presents the foundation of GRAPE. The principled approach of
GRAPE is based on partial evaluation and incremental computation. Sequential graph
algorithms can be plugged into GRAPE with minor changes, and get parallelized as a
whole. The termination and correctness are guaranteed under a monotonic condition.
Secondly, as an application on GRAPE, the thesis proposes graph-pattern association
rules (GPARs) for social media marketing. GPARs help users discover regularities
between entities in social graphs and identify potential customers by exploring social
influence. The thesis studies the problem of discovering top-k diversified GPARs and
the problem of identifying potential customers with GPARs. Although both are NP-
hard, parallel scalable algorithms on GRAPE are developed, which guarantee a polynomial
speedup over sequential algorithms with the increase of processors.
Thirdly, the thesis proposes quantified graph patterns (QGPs), an extension of
graph patterns by supporting simple counting quantifiers on edges. QGPs naturally express
universal and existential quantification, numeric and ratio aggregates, as well as
negation. The thesis proves that the matching problem of QGPs remains NP-complete
in the absence of negation, and is DP-complete for general QGPs. In addition, the
thesis introduces quantified graph association rules defined with QGPs, to identify potential
customers in social media marketing.
Finally, to address the issue of data consistency, the thesis proposes a class of functional
dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value
dependencies and topological structures of entities. The satisfiability and implication
problems for GFDs are studied and proved to be coNP-complete and NP-complete,
respectively. The thesis also proves that the validation problem for GFDs is coNP-
complete. The parallel algorithms developed on GRAPE verify that GFDs provide an
effective approach to detecting inconsistencies in knowledge and social graphs