30,070 research outputs found

    Efficient mining of frequent item sets on large uncertain databases

    Get PDF
    The data handled in emerging applications like location-based services, sensor monitoring systems, and data integration, are often inexact in nature. In this paper, we study the important problem of extracting frequent item sets from a large uncertain database, interpreted under the Possible World Semantics (PWS). This issue is technically challenging, since an uncertain database contains an exponential number of possible worlds. By observing that the mining process can be modeled as a Poisson binomial distribution, we develop an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database. We also study the important issue of maintaining the mining result for a database that is evolving (e.g., by inserting a tuple). Specifically, we propose incremental mining algorithms, which enable Probabilistic Frequent Item set (PFI) results to be refreshed. This reduces the need of re-executing the whole mining algorithm on the new database, which is often more expensive and unnecessary. We examine how an existing algorithm that extracts exact item sets, as well as our approximate algorithm, can support incremental mining. All our approaches support both tuple and attribute uncertainty, which are two common uncertain database models. We also perform extensive evaluation on real and synthetic data sets to validate our approaches. © 1989-2012 IEEE.published_or_final_versio

    Study of Association Rule Mining and Different Hiding Techniques

    Get PDF
    Data mining is the process of extracting hidden patterns from data. As more data is gathered,with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. In this paper, we first focused on APRIORI algorithm, a popular data mining technique and compared the performances of a linked list based implementation as a basis and a tries-based implementation on it for mining frequent item sequences in a transactional database. We examined the data structure, implementation and algorithmic features mainly focusing on those that also arise in frequent item set mining. This algorithm has given us new capabilities to identify associations in large data sets. But a key problem, and still not sufficiently investigated, is the need to balance the confidentiality of the disclosed data with the legitimate needs of the data users. One rule is characterized as sensitive if its disclosure risk is above a certain privacy threshold. Sometimes, sensitive rules should not be disclosed to the public, since among other things, they may be used for inferring sensitive data, or they may provide business competitors with an advantage. So, next we worked with some association rule hiding algorithms and examined their performances in order to analyze their time complexity and the impact that they have in the original database. We worked on two different side effects – one was the number of new rules generated during the hiding process and the other one was the number of non-sensitive rules lost during the process

    Mining very long sequences with PLWAPLong algorithms

    Get PDF
    Sequential pattern mining is the process of finding inter-transaction frequent sequential patterns from a sequential database, where records consist of ordered sets of events (or items), by applying data mining techniques on such sequential databases. Discovering sequential patterns in web server logs is an example application of sequential mining, which is useful for predicting visiting patterns of web users for such purposes as targeted advertisements. Position Coded Pre-order Linked Web Access Pattern (PLWAP) mining algorithm is one of the existing efficient web sequential pattern mining algorithms, which stores the frequently stored sequences of the entire sequential database in a compressed tree form with position coded nodes. However, for very long sequences exceeding thirty two nodes, the number of bits an integer position code can hold, the PLWAP algorithm\u27s performance begins to degrade because it employs linked lists to store conjunctions of long position codes and the linked list traversals slow down the algorithm both during tree construction and mining. PLWAP algorithm also uses each and every node in the frequent 1-item event queue to test for that event inclusion in the suffix tree root set during mining. This is a very expensive operation since except for one node all other nodes that are its ancestors and descendents are not included in the root set. This thesis proposes two new algorithms, i.e. PLWAPLong1 and PLWAPLong2. Both of these new algorithms use a new position code numbering scheme where each node is assigned two numeric variables (startPosition, endPosition) instead of one. Using this scheme we can determine the ancestor node in O (1) operation by comparing the startPosition and endPosition of two nodes. PLWAPLong1 algorithm also proposes transforming the linked list based tree to an equivalent array representation and using binary search to find the immediate descendant in a suffix tree. PLWAPLong2 uses existing linked list based tree. Both PLWAPLong1 and PLWAPLong2 algorithms introduce a new technique called Last Descendant to eliminate the unwanted nodes from ancestor/descendent test when creating the suffix tree root set. Keywords: Data mining, Web Mining, Association Rule Mining, Long Sequences, PLWAP Minin

    A Model-Based Frequency Constraint for Mining Associations from Transaction Data

    Full text link
    Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user
    corecore