    Max-FISM: Mining (recently) maximal frequent itemsets over data streams using the sliding window model

    AbstractFrequent itemset mining from data streams is an important data mining problem with broad applications such as retail market data analysis, network monitoring, web usage mining, and stock market prediction. However, it is also a difficult problem due to the unbounded, high-speed and continuous characteristics of streaming data. Therefore, extracting frequent itemsets from more recent data can enhance the analysis of stream data. In this paper, we propose an efficient algorithm, called Max-FISM (Maximal-Frequent Itemsets Mining), for mining recent maximal frequent itemsets from a high-speed stream of transactions within a sliding window. According to our algorithm, whenever a new transaction is inserted in the current window only its maximum itemset should be inserted into a prefix tree-based summary data structure called Max-Set for maintaining the number of independent appearance of each transaction in the current window. Finally, the set of recent maximal frequent itemsets is obtained from the current Max-Set. Experimental studies show that the proposed Max-FISM algorithm is highly efficient in terms of memory and time complexity for mining recent maximal frequent itemsets over high-speed data streams

    Knowledge discovery in data streams

    Knowing what to do with the massive amount of data collected has always been an ongoing issue for many organizations. While data mining has been touted to be the solution, it has failed to deliver the impact despite its successes in many areas. One reason is that data mining algorithms were not designed for the real world, i.e., they usually assume a static view of the data and a stable execution environment where resources are abundant. The reality however is that data are constantly changing and the execution environment is dynamic. Hence, it becomes difficult for data mining to truly deliver timely and relevant results. Recently, the processing of stream data has received many attention. What is interesting is that the methodology to design stream-based algorithms may well be the solution to the above problem. In this entry, we discuss this issue and present an overview of recent works

    Techniques for improving clustering and association rules mining from very large transactional databases

    Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the user’s expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work

    A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream

    Full text link
    The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators

    Discovering Interesting Patterns and Associations in Data Streams

    A data stream is a sequence of items that arrive in a timely order. Different from data in traditional static databases, data streams are continuous, unbounded, usually come with high speed, and have a data value distribution that often changes with time (Guha, 2001). As more applications such as web transactions, telephone records, and network flows generate a large number of data streams every day, efficient knowledge discovery of data streams is an active and growing research area in data mining with broad applications. Traditional data mining algorithms are developed to work on a complete static dataset and, thus, cannot be applied directly in data stream applications.One area of data mining research is to mine association relationship in a data set. Most of association mining techniques for data streams can be categorized into two types: those developed based on frequent patterns and those developed based on closed patterns. Due to the number of frequent patterns are often huge and redundant, non-informative patterns are contained in frequent patterns. An alternative way is to develop the association mining approaches for data streaming applications based on closed patterns, which generally represent a small subset of all frequent patterns, but provide complete and condensed information. In these researches, the closed pattern mining is the prerequisite condition for non-redundant and informative association mining.In this dissertation, a sliding window technique for dynamic mining of closed patterns in data streams is proposed, and an approach of mining non-redundant and informative associations based on the discovered closed patterns is developed. The closed pattern and relevant association mining techniques are selected research area in this dissertation. First, the closed patterns for a given collection of data are currently the most compact data knowledge that can provide complete support information for all data patterns.Compared with other techniques, the proposed closed pattern mining technique has potential to largely decrease the number of subsequent combinatorial calculations performed on the data patterns. Second, the memory requirement to store the closed patterns and relevant associations is generally lower than the corresponding frequent patterns and associations. In some data streaming applications, memory usage is an important measurement, because in these applications memory usage is the bottleneck for knowledge discovery. Third, the associations generated for data streams are the knowledge used to identify the relations within the data. The discovered relations can find their wide applications in many data streaming environments.Different from the closed pattern mining techniques on traditional databases, which require multiple scans of the entire database, the proposed technique determines the closed patterns with a single scan. It is an incremental mining process; as the sliding window advances, new data transactions enter and old data transactions exit the window. But instead of regenerating closed patterns from the entire window, the proposed technique updates the old set of closed patterns whenever a new transaction arrives and/or an old transaction leaves the sliding window to obtain the current set of closed patterns. This incremental feature allows the user to get the most recent updated closed patterns without rescanning the entire updated database, which saves not only the computation time, but more importantly, the I/O operating time to load and write data from database to memory. Third, the proposed sliding window technique can handle both the insertion and deletion operations independently, which allows the user to adjust the sliding window size in different application environments. Furthermore, the proposed interesting patterns and association mining framework can handle different users' requests at the same time at their specified support and confidence thresholds, and interested input and output patterns.The research includes both theoretical proofs of correctness for the proposed algorithms and simulation experiments to compare the proposed techniques with those existing in the literature using synthetic and real datasets. The utility of the proposed technique is applied to sensor network databases of a traffic management and an environmental monitoring site for missing data estimation purpose

    Adaptive Learning and Mining for Data Streams and Frequent Patterns

    Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.Postprint (published version

    Mining Association Rules Events over Data Streams

    Data streams have gained considerable attention in data analysis and data mining communities because of the emergence of a new classes of applications, such as monitoring, supply chain execution, sensor networks, oilfield and pipeline operations, financial marketing and health data industries. Telecommunication advancements have provided us with easy access to stream data produced by various applications. Data in streams differ from static data stored in data warehouses or database. Data streams are continuous, arrive at high-speeds and change through time. Traditional data mining algorithms assume presence of data in conventional storage means where data mining is performed centrally with the luxury of accessing the data multiple times, using powerful processors, providing offline output with no time constraints. Such algorithms are not suitable for dynamic data streams. Stream data needs to be mined promptly as it might not be feasible to store such volume of data. In addition, streams reflect live status of the environment generating it, so prompt analysis may provide early detection of faults, delays, performance measurements, trend analysis and other diagnostics. This thesis focuses on developing a data stream association rule mining algorithm among co-occurring events. The proposed algorithm mines association rules over data streams incrementally in a centralized setting. We are interested in association rules that meet a provided minimum confidence threshold and have a lift value greater than 1. We refer to such association rules as strong rules. Experiments on several datasets demonstrate that the proposed algorithms is efficient and effective in extracting association rules from data streams, thus having a faster processing time and better memory management