1,764 research outputs found
Frequent Itemset Mining for Big Data
Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays.
Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data.
As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic.
In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii).
The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies.
The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution.
Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues.
Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets
Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems
Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional databases. Apriori is a
classical frequent itemset mining algorithm, which employs iterative passes
over database combining with generation of candidate itemsets based on frequent
itemsets found at the previous iteration, and pruning of clearly infrequent
itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of
Apriori, which tries to reduce the number of passes made over a transactional
database while keeping the number of itemsets counted in a pass relatively low.
In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi
many-core system for the case when the transactional database fits in main
memory. Intel Xeon Phi provides a large number of small compute cores with
vector processing units. The paper presents a parallel implementation of DIC
based on OpenMP technology and thread-level parallelism. We exploit the
bit-based internal layout for transactions and itemsets. This technique reduces
the memory space for storing the transactional database, simplifies the support
count via logical bitwise operation, and allows for vectorization of such a
step. Experimental evaluation on the platforms of the Intel Xeon CPU and the
Intel Xeon Phi coprocessor with large synthetic and real databases showed good
performance and scalability of the proposed algorithm.Comment: Accepted for publication in Journal of Computing and Information
Technology (http://cit.fer.hr
Data distribution and performance optimization models for parallel data mining
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 117-128.We have embarked upon a multitude of approaches to improve the efficiency of
selected fundamental tasks in data mining. The present thesis is concerned with
improving the efficiency of parallel processing methods for large amounts of data.
We have devised new parallel frequent itemset mining algorithms that work on
both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the
all-pairs similarity problem.
Two new parallel frequent itemset mining (FIM) algorithms named NoClique
and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm
named bitdrill, and uses a method based on graph partitioning by vertex
separator (GPVS) to distribute and selectively replicate items. The method operates
on a graph where vertices correspond to frequent items and edges correspond
to frequent itemsets of size two. We show that partitioning this graph by a vertex
separator is sufficient to decide a distribution of the items such that the
sub-databases determined by the item distribution can be mined independently.
This distribution entails an amount of data replication, which may be reduced
by setting appropriate weights to vertices. The data distribution scheme is used
in the design of two new parallel frequent itemset mining algorithms. Both algorithms
replicate the items that correspond to the separator. NoClique replicates
the work induced by the separator and NoClique2 computes the same work collectively.
Computational load balancing and minimization of redundant or collective
work may be achieved by assigning appropriate load estimates to vertices. The
performance is compared to another parallelization that replicates all items, and
ParDCI algorithm. We introduce another parallel FIM method using a variation of item distribution
with selective item replication. We extend the GPVS model for parallel
FIM we have proposed earlier, by relaxing the condition of independent mining.
Instead of finding independently mined item sets, we may minimize the amount of
communication and partition the candidates in a fine-grained manner. We introduce
a hypergraph partitioning model of the parallel computation where vertices
correspond to candidates and hyperedges correspond to items. A load estimate is
assigned to each candidate with vertex weights, and item frequencies are given as
hyperedge weights. The model is shown to minimize data replication and balance
load accurately. We also introduce a re-partitioning model since we can generate
only so many levels of candidates at once, using fixed vertices to model previous
item distribution/replication. Experiments show that we improve over the higher
load imbalance of NoClique2 algorithm for the same problem instances at the
cost of additional parallel overhead.
For the all-pairs similarity problem, we extend recent efficient sequential algorithms
to a parallel setting, and obtain document-wise and term-wise parallelizations
of a fast sequential algorithm, as well as an elegant combination of two
algorithms that yield a 2-D distribution of the data. Two effective algorithmic
optimizations for the term-wise case are reported that make the term-wise parallelization
feasible. These optimizations exploit local pruning and block processing
of a number of vectors, in order to decrease communication costs, the number of
candidates, and communication/computation imbalance. The correctness of local
pruning is proven. Also, a recursive term-wise parallelization is introduced. The
performance of the algorithms are shown to be favorable in extensive experiments,
as well as the utility of two major optimizations.Özkural, ErayPh.D
Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster
Designing fast and scalable algorithm for mining frequent itemsets is always
being a most eminent and promising problem of data mining. Apriori is one of
the most broadly used and popular algorithm of frequent itemset mining.
Designing efficient algorithms on MapReduce framework to process and analyze
big datasets is contemporary research nowadays. In this paper, we have focused
on the performance of MapReduce based Apriori on homogeneous as well as on
heterogeneous Hadoop cluster. We have investigated a number of factors that
significantly affects the execution time of MapReduce based Apriori running on
homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both
algorithmic and non-algorithmic improvements. Considered factors specific to
algorithmic improvements are filtered transactions and data structures.
Experimental results show that how an appropriate data structure and filtered
transactions technique drastically reduce the execution time. The
non-algorithmic factors include speculative execution, nodes with poor
performance, data locality & distribution of data blocks, and parallelism
control with input split size. We have applied strategies against these factors
and fine tuned the relevant parameters in our particular application.
Experimental results show that if cluster specific parameters are taken care of
then there is a significant reduction in execution time. Also we have discussed
the issues regarding MapReduce implementation of Apriori which may
significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing,
Communication and Automation (ICCCA2016
An efficient parallel method for mining frequent closed sequential patterns
Mining frequent closed sequential pattern (FCSPs) has attracted a great deal of research attention, because it is an important task in sequences mining. In recently, many studies have focused on mining frequent closed sequential patterns because, such patterns have proved to be more efficient and compact than frequent sequential patterns. Information can be fully extracted from frequent closed sequential patterns. In this paper, we propose an efficient parallel approach called parallel dynamic bit vector frequent closed sequential patterns (pDBV-FCSP) using multi-core processor architecture for mining FCSPs from large databases. The pDBV-FCSP divides the search space to reduce the required storage space and performs closure checking of prefix sequences early to reduce execution time for mining frequent closed sequential patterns. This approach overcomes the problems of parallel mining such as overhead of communication, synchronization, and data replication. It also solves the load balance issues of the workload between the processors with a dynamic mechanism that re-distributes the work, when some processes are out of work to minimize the idle CPU time.Web of Science5174021739
- …