101 research outputs found
Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than
one hypothesis to be tested simultaneously for statistical significance. This
is a very common situation in many data mining applications. For instance,
assessing simultaneously the significance of all frequent itemsets of a single
dataset entails a host of hypothesis, one for each itemset. A multiple
hypothesis testing method is needed to control the number of false positives
(Type I error). Our contribution in this paper is to extend the multiple
hypothesis framework to be used with a generic data mining algorithm. We
provide a method that provably controls the family-wise error rate (FWER, the
probability of at least one false positive) in the strong sense. We evaluate
the performance of our solution on both real and generated data. The results
show that our method controls the FWER while maintaining the power of the test.Comment: 28 page
From an implicational system to its corresponding D-basis
Closure system is a fundamental concept appearing in several areas such as databases, formal concept analysis, artificial intelligence, etc. It is well-known that there exists a connection between a closure operator on a set and the lattice of its closed sets. Furthermore, the closure system can be replaced by a set of implications but this set has usually a lot of redundancy inducing non desired properties.
In the literature, there is a common interest in the search of the mini- mality of a set of implications because of the importance of bases. The well-known Duquenne-Guigues basis satisfies this minimality condition. However, several authors emphasize the relevance of the optimality in order to reduce the size of implications in the basis. In addition to this, some bases have been defined to improve the computation of closures relying on the directness property. The efficiency of computation with the direct basis is achieved due to the fact that the closure is computed in one traversal.
In this work, we focus on the D-basis, which is ordered-direct. An open problem is to obtain it from an arbitrary implicational system, so it is our aim in this paper. We introduce a method to compute the D-basis by means of minimal generators calculated using the Simplification Logic for implications.Universidad de Málaga. Campus de Excelencia Internacional AndalucĂa Tech. Supported by Grants TIN2011-28084 and TIN2014-59471-P of the Science and Innovation Ministry of Spain, which is co-financed by the European Social Fund
Interactive Constrained Association Rule Mining
We investigate ways to support interactive mining sessions, in the setting of
association rule mining. In such sessions, users specify conditions (queries)
on the associations to be generated. Our approach is a combination of the
integration of querying conditions inside the mining phase, and the incremental
querying of already generated associations. We present several concrete
algorithms and compare their performance.Comment: A preliminary report on this work was presented at the Second
International Conference on Knowledge Discovery and Data Mining (DaWaK 2000
Multi-Sorted Inverse Frequent Itemsets Mining: On-Going Research
Inverse frequent itemset mining (IFM) consists of generating artificial transactional databases reflecting patterns of real ones, in particular, satisfying given frequency constraints on the itemsets. An extension of IFM called many-sorted IFM, is introduced where the schemes for the datasets to be generated are those typical of Big Tables, as required in emerging big data applications, e.g., social network analytics
- …