25,070 research outputs found
Mining Frequent Sequential Patterns From Multiple Databases Using Transaction Ids
Mining frequent sequential patterns from multiple databases to discover more complex patterns from multiple data sources such as multiple E-Commerce (B2C) web sites for comparative, historical and derived analysis, poses the additional challenge of integrating mined patterns from multiple sources during various levels of mining. A few existing work on mining frequent patterns from multiple databases (MDBâs) are the ApproxMap algorithm and the TidFP algorithm. The ApproxMap algorithm breaks its input sequences (e.g., the 2-column sequence \u3c(123)(45)\u3e) into columns so it can find the collection of approximate frequent sequences of all the columns as the approximate sequence of the database. The same method is used to integrate frequent sequences from each MDB that must have the same table structure. The TidFP algorithm mines frequent item_sets from multiple sources of different table structures and related through foreign key attributes using transaction ids for integrating patterns through set operations (e.g., intersect, union) in order to answer global queries involving multiple sources. The limitations of existing work on multiple database sequential pattern mining such as the ApproxMap algorithm is that they are not able to mine frequent sequences to answer exact and historical queries from MDBâs of different structure; while the TidFp algorithm can only answer queries from MDBâs on item_sets but not for sequences. This thesis proposes the Transaction id frequent sequence pattern (TidFSeq) algorithm which uses the techniques of the TidFP algorithm for mining item sets on the problem of mining frequent sequences from diverse MDBâs. The challenges with mining frequent sequences from MDBs that is solved by this thesis are that the TidFSeq algorithm first computes the element (ie. Sequence item position id) in which each item in each sequence (ie. sequence id) occurs by replacing the tuple used in the TidFp with a tuple. For every item âiâ in the kth sequence of n-sequence of length ânâ, the TidFSeq algorithm first transforms it into a tuple that specifies (a) itâs transaction id (Tid) and (b) the list of the kth sequence in this transaction that item âiâ occurs (called itâs position id list). Next the GSP-like candidate generation approach is used on our transformed sequences to generate frequent sequences with transacion ids which can be used to answer complex queries from MDBâs through set operations. The proposed TidFSeq algorithm, PrefixSpan algorithm and ApproxMap algorithm are compared with respect to the results obtained for a given query, processing speed and memory requirement. Experiments show that the proposed TidFSeq algorithm mines the exact frequent sequences (ie. 100% accuracy) from multiple sequence tables, when compared to the ApproxMap algorithm that has an accuracy of 79%. The TidFSeq algorithm has faster processing time for mining frequent sequences from multiple tables than the PrefixSpan and ApproxMap algorithms
SPAMS: A Novel Incremental Approach for Sequential Pattern Mining in Data Streams
International audienceMining sequential patterns in data streams is a new challenging problem for the datamining community since data arrives sequentially in the form of continuous rapid and inïŹnite streams. In this paper, we propose a new on-line algorithm, SPAMS, to deal with the sequential patterns mining problem in data streams. This algorithm uses an automaton-based structure to maintain the set of frequent sequential patterns, i.e. SPA (Sequential Pat- tern Automaton). The sequential pattern automaton can be smaller than the set of frequent sequential patterns by two or more orders of magnitude, which allows us to overcome the problem of combinatorial explosion of se- quential patterns. Current results can be output constantly on any user 's speciïŹed thresholds. In addition, taking into account the characteristics of data streams, we propose a well-suited method said to be approximate since we can provide near optimal results with a high probability. Experimental studies show the relevance of the SPA data structure and the eïŹciency of the SPAMS algorithm on various datasets. Our contribution opens a promis- ing gateway, by using an automaton as a data structure for mining frequent sequential patterns in data streams
Privacy Preserving Utility Mining: A Survey
In big data era, the collected data usually contains rich information and
hidden knowledge. Utility-oriented pattern mining and analytics have shown a
powerful ability to explore these ubiquitous data, which may be collected from
various fields and applications, such as market basket analysis, retail,
click-stream analysis, medical analysis, and bioinformatics. However, analysis
of these data with sensitive private information raises privacy concerns. To
achieve better trade-off between utility maximizing and privacy preserving,
Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent
years. In this paper, we provide a comprehensive overview of PPUM. We first
present the background of utility mining, privacy-preserving data mining and
PPUM, then introduce the related preliminaries and problem formulation of PPUM,
as well as some key evaluation criteria for PPUM. In particular, we present and
discuss the current state-of-the-art PPUM algorithms, as well as their
advantages and deficiencies in detail. Finally, we highlight and discuss some
technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page
- âŠ