738 research outputs found
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Grafting for Combinatorial Boolean Model using Frequent Itemset Mining
This paper introduces the combinatorial Boolean model (CBM), which is defined
as the class of linear combinations of conjunctions of Boolean attributes. This
paper addresses the issue of learning CBM from labeled data. CBM is of high
knowledge interpretability but na\"{i}ve learning of it requires exponentially
large computation time with respect to data dimension and sample size. To
overcome this computational difficulty, we propose an algorithm GRAB (GRAfting
for Boolean datasets), which efficiently learns CBM within the
-regularized loss minimization framework. The key idea of GRAB is to
reduce the loss minimization problem to the weighted frequent itemset mining,
in which frequent patterns are efficiently computable. We employ benchmark
datasets to empirically demonstrate that GRAB is effective in terms of
computational efficiency, prediction accuracy and knowledge discovery
Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than
one hypothesis to be tested simultaneously for statistical significance. This
is a very common situation in many data mining applications. For instance,
assessing simultaneously the significance of all frequent itemsets of a single
dataset entails a host of hypothesis, one for each itemset. A multiple
hypothesis testing method is needed to control the number of false positives
(Type I error). Our contribution in this paper is to extend the multiple
hypothesis framework to be used with a generic data mining algorithm. We
provide a method that provably controls the family-wise error rate (FWER, the
probability of at least one false positive) in the strong sense. We evaluate
the performance of our solution on both real and generated data. The results
show that our method controls the FWER while maintaining the power of the test.Comment: 28 page
Effective pattern discovery for text mining
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance
- …