Search CORE

2 research outputs found

Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model

Author: Djenouri Youcef
Li Yuanfa
Lin Jerry Chun-Wei
Srivastava Gautam
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.acceptedVersio

SINTEF Open

NORA - Norwegian Open Research Archives

MapFIM+: Memory Aware Parallelized Frequent Itemset Mining In Very Large Datasets

Author: CC Aggarwal
J Dean
K Beedkar
K-C Duong
M. Al Hajj Hassan
M. Al Hajj Hassan
S Salah
S Salah
Publication venue: Springer Berlin / Heidelberg
Publication date: 28/08/2017
Field of study

International audienceMining frequent itemsets in large datasets has received much attention in recent years relying on MapReduce programming model. For instance, many efficient Frequent Itemset Mining (a.k.a. FIM) algorithms have been parallelized to MapReduce principle such as Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most approaches focus on job partitioning and/or load balancing without considering the extensibility depending on required memory assumptions. Thus, a challenge in designing parallel FIM algorithms consists therefore in finding ways to guarantee that data structures used during the mining process always fit in the local memory of processing nodes during all computation steps.In this paper, we propose MapFIM+, a two-phase approach to frequent itemset mining in very large datasets benefiting both from a MapReduce based distributed Apriori method and local in-memory FIM methods. In our approach, MapReduce is first used to generate frequent itemsets until getting local memory-fitted prefix-projected databases, and an optimized local in-memory mining process is then launched to generate all remaining frequent itemsets from each prefix-projected database on individual processing nodes. Indeed, MapFIM+ improves our previous algorithm MapFIM by using an exact evaluation of prefix-projected database sizes during the MapReduce phase. This improvement makes MapFIM+ more efficient, especially for databases leading to huge candidate sets, by significantly reducing communication and disk I/O costs. Performance evaluation shows that MapFIM+ is more efficient and more extensible than existing MapReduce based frequent itemset mining approaches

Crossref

HAL Université de Tours