471 research outputs found
An efficient closed frequent itemset miner for the MOA stream mining system
Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version
New Spark solutions for distributed frequent itemset and association rule mining algorithms
Funding for open access publishing: Universidad de Gran-
ada/CBUA. The research reported in this paper was partially sup-
ported by the BIGDATAMED project, which has received funding
from the Andalusian Government (Junta de Andalucı ́a) under grant
agreement No P18-RT-1765, by Grants PID2021-123960OB-I00 and
Grant TED2021-129402B-C21 funded by Ministerio de Ciencia e
Innovacio ́n and, by ERDF A way of making Europe and by the
European Union NextGenerationEU. In addition, this work has been
partially supported by the Ministry of Universities through the EU-
funded Margarita Salas programme NextGenerationEU. Funding for
open access charge: Universidad de Granada/CBUAThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with
massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information
in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous
phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major
problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for
frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive
computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the
existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform
which has been demonstrated to outperform existing distributive algorithmic implementations.Universidad de Granada/CBUAJunta de Andalucia
P18-RT-1765Ministry of Science and Innovation, Spain (MICINN)
Instituto de Salud Carlos III
Spanish Government
PID2021-123960OB-I00,
TED2021-129402B-C21ERDF A way of making EuropeEuropean Union NextGenerationEUMinistry of Universities through the E
A Novel Nodesets-Based Frequent Itemset Mining Algorithm for Big Data using MapReduce
Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera\u27s Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data
Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining
An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread
- …