25,070 research outputs found

    Mining Frequent Sequential Patterns From Multiple Databases Using Transaction Ids

    Get PDF
    Mining frequent sequential patterns from multiple databases to discover more complex patterns from multiple data sources such as multiple E-Commerce (B2C) web sites for comparative, historical and derived analysis, poses the additional challenge of integrating mined patterns from multiple sources during various levels of mining. A few existing work on mining frequent patterns from multiple databases (MDB’s) are the ApproxMap algorithm and the TidFP algorithm. The ApproxMap algorithm breaks its input sequences (e.g., the 2-column sequence \u3c(123)(45)\u3e) into columns so it can find the collection of approximate frequent sequences of all the columns as the approximate sequence of the database. The same method is used to integrate frequent sequences from each MDB that must have the same table structure. The TidFP algorithm mines frequent item_sets from multiple sources of different table structures and related through foreign key attributes using transaction ids for integrating patterns through set operations (e.g., intersect, union) in order to answer global queries involving multiple sources. The limitations of existing work on multiple database sequential pattern mining such as the ApproxMap algorithm is that they are not able to mine frequent sequences to answer exact and historical queries from MDB’s of different structure; while the TidFp algorithm can only answer queries from MDB’s on item_sets but not for sequences. This thesis proposes the Transaction id frequent sequence pattern (TidFSeq) algorithm which uses the techniques of the TidFP algorithm for mining item sets on the problem of mining frequent sequences from diverse MDB’s. The challenges with mining frequent sequences from MDBs that is solved by this thesis are that the TidFSeq algorithm first computes the element (ie. Sequence item position id) in which each item in each sequence (ie. sequence id) occurs by replacing the tuple used in the TidFp with a tuple. For every item ‘i’ in the kth sequence of n-sequence of length ‘n’, the TidFSeq algorithm first transforms it into a tuple that specifies (a) it’s transaction id (Tid) and (b) the list of the kth sequence in this transaction that item ‘i’ occurs (called it’s position id list). Next the GSP-like candidate generation approach is used on our transformed sequences to generate frequent sequences with transacion ids which can be used to answer complex queries from MDB’s through set operations. The proposed TidFSeq algorithm, PrefixSpan algorithm and ApproxMap algorithm are compared with respect to the results obtained for a given query, processing speed and memory requirement. Experiments show that the proposed TidFSeq algorithm mines the exact frequent sequences (ie. 100% accuracy) from multiple sequence tables, when compared to the ApproxMap algorithm that has an accuracy of 79%. The TidFSeq algorithm has faster processing time for mining frequent sequences from multiple tables than the PrefixSpan and ApproxMap algorithms

    SPAMS: A Novel Incremental Approach for Sequential Pattern Mining in Data Streams

    Full text link
    International audienceMining sequential patterns in data streams is a new challenging problem for the datamining community since data arrives sequentially in the form of continuous rapid and inïŹnite streams. In this paper, we propose a new on-line algorithm, SPAMS, to deal with the sequential patterns mining problem in data streams. This algorithm uses an automaton-based structure to maintain the set of frequent sequential patterns, i.e. SPA (Sequential Pat- tern Automaton). The sequential pattern automaton can be smaller than the set of frequent sequential patterns by two or more orders of magnitude, which allows us to overcome the problem of combinatorial explosion of se- quential patterns. Current results can be output constantly on any user 's speciïŹed thresholds. In addition, taking into account the characteristics of data streams, we propose a well-suited method said to be approximate since we can provide near optimal results with a high probability. Experimental studies show the relevance of the SPA data structure and the eïŹƒciency of the SPAMS algorithm on various datasets. Our contribution opens a promis- ing gateway, by using an automaton as a data structure for mining frequent sequential patterns in data streams

    Privacy Preserving Utility Mining: A Survey

    Full text link
    In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page
    • 

    corecore