56 research outputs found

    Impacts of frequent itemset hiding algorithms on privacy preserving data mining

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2010Includes bibliographical references (leaves: 54-58)Text in English; Abstract: Turkish and Englishx, 69 leavesThe invincible growing of computer capabilities and collection of large amounts of data in recent years, make data mining a popular analysis tool. Association rules (frequent itemsets), classification and clustering are main methods used in data mining research. The first part of this thesis is implementation and comparison of two frequent itemset mining algorithms that work without candidate itemset generation: Matrix Apriori and FP-Growth. Comparison of these algorithms revealed that Matrix Apriori has higher performance with its faster data structure. One of the great challenges of data mining is finding hidden patterns without violating data owners. privacy. Privacy preserving data mining came into prominence as a solution. In the second study of the thesis, Matrix Apriori algorithm is modified and a frequent itemset hiding framework is developed. Four frequent itemset hiding algorithms are proposed such that: i) all versions work without pre-mining so privacy breech caused by the knowledge obtained by finding frequent itemsets is prevented in advance, ii) efficiency is increased since no pre-mining is required, iii) supports are found during hiding process and at the end sanitized dataset and frequent itemsets of this dataset are given as outputs so no post-mining is required, iv) the heuristics use pattern lengths rather than transaction lengths eliminating the possibility of distorting more valuable data

    Association rule hiding using integer linear programming

    Get PDF
    Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database

    Privacy preserving data mining

    Get PDF
    A fruitful direction for future data mining research will be the development of technique that incorporates privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We analyze the possibility of privacy in data mining techniques in two phasesrandomization and reconstruction. Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To preserve client privacy in the data mining process, techniques based on random perturbation of data records are used. Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. This approach is called randomization. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. For the concept of using value distortion to protect privacy to be useful, we need to be able to reconstruct the original data distribution so that data mining techniques can be effectively utilized to yield the required statistics. Analysis Let xi be the original instance of data at client i. We introduce a random shift yi using randomization technique explained below. The server runs the reconstruction algorithm (also explained below) on the perturbed value zi = xi + yi to get an approximate of the original data distribution suitable for data mining applications. Randomization We have used the following randomizing operator for data perturbation: Given x, let R(x) be x+€ (mod 1001) where € is chosen uniformly at random in {-100…100}. Reconstruction of discrete data set P(X=x) = f X (x) ----Given P(Y=y) = F y (y) ---Given P (Z=z) = f Z (z) ---Given f (X/Z) = P(X=x | Z=z) = P(X=x, Z=z)/P (Z=z) = P(X=x, X+Y=Z)/ f Z (z) = P(X=x, Y=Z - X)/ f Z (z) = P(X=x)*P(Y=Z-X)/ f Z (z) = P(X=x)*P(Y=y)/ f Z (z) Results In this project we have done two aspects of privacy preserving data mining. The first phase involves perturbing the original data set using ‘randomization operator’ techniques and the second phase deals with reconstructing the randomized data set using the proposed algorithm to get an approximate of the original data set. The performance metrics like percentage deviation, accuracy and privacy breaches were calculated. In this project we studied the technical feasibility of realizing privacy preserving data mining. The basic promise was that the sensitive values in a user’s record will be perturbed using a randomizing function and an approximate of the perturbed data set be recovered using reconstruction algorithm

    Abduction and Anonymity in Data Mining

    Get PDF
    This thesis investigates two new research problems that arise in modern data mining: reasoning on data mining results, and privacy implication of data mining results. Most of the data mining algorithms rely on inductive techniques, trying to infer information that is generalized from the input data. But very often this inductive step on raw data is not enough to answer the user questions, and there is the need to process data again using other inference methods. In order to answer high level user needs such as explanation of results, we describe an environment able to perform abductive (hypothetical) reasoning, since often the solutions of such queries can be seen as the set of hypothesis that satisfy some requirements. By using cost-based abduction, we show how classification algorithms can be boosted by performing abductive reasoning over the data mining results, improving the quality of the output. Another growing research area in data mining is the one of privacy-preserving data mining. Due to the availability of large amounts of data, easily collected and stored via computer systems, new applications are emerging, but unfortunately privacy concerns make data mining unsuitable. We study the privacy implications of data mining in a mathematical and logical context, focusing on the anonymity of people whose data are analyzed. A formal theory on anonymity preserving data mining is given, together with a number of anonymity-preserving algorithms for pattern mining. The post-processing improvement on data mining results (w.r.t. utility and privacy) is the central focus of the problems we investigated in this thesis

    Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family Boundaries

    Get PDF
    Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties
    corecore