Mining frequent itemsets in distorted databases with granula computing

Abstract

Data perturbation is a popular method to achieve privacy-preserving data mining. However, distorted databases bring enormous overheads to mining algorithms as compared to original databases. In this paper, we present the GrC-FIM algorithm to address the efficiency problem in mining frequent itemsets from distorted databases. Two measures are introduced to overcome the weakness in existing work: firstly, the concept of independent granule is introduced, and granule inference is used to distinguish between non-independent itemsets and independent itemsets. We further prove that the support counts of non-independent itemsets can be directly derived from subitemsets, so that the error-prone reconstruction process can be avoided. This could improve the efficiency of the algorithm, and bring more accurate results; secondly, through the granular-bitmap representation, the support counts can be calculated in an efficient way. The empirical results on representative synthetic and real-world databases indicate that the proposed GrC-FIM algorithm outperforms the popular EMASK algorithm in both the efficiency and the support count reconstruction accuracy.<br /

    Similar works

    Full text

    thumbnail-image

    Available Versions