214,615 research outputs found
Mining of Frequent Item with BSW Chunking
Apriori is an algorithm for finding the frequent patterns in transactional databases is considered as one of the most important data mining problems. Apriori algorithm is a masterpiece algorithm of association rule mining. This algorithm somehow has constraint and thus, giving the opportunity to do this research. Increased availability of the Multicore processors is forcing us to re-design algorithms and applications so as to accomplishment the computational power from multiple cores finding frequent item sets is more expensive in terms of computing resources utilization and CPU power. Thus superiority of parallel apriori algorithms effect on parallelizing the process of frequent item set find. The parallel frequent item sets mining algorithms gives the direction to solve the issue of distributing the candidates among processors. Efficient algorithm to discover frequent patterns is important in data mining research Lots of algorithms for mining association rules and their mutations are proposed on basis of Apriori algorithm, but traditional algorithms are not efficient. The objective of the Apriori Algorithm is to find associations between different sets of data. It is occasionally referred to as "Market Basket Analysis". Every several set of data has a number of items and is called a transaction. The achievement of Apriori is sets of rules that tell us how often items are contained in sets of data. In order to find more valuable rules, our basic aim is to implement apriori algorithm using multithreading approach which can utilization our system hardware power to improved algorithm is reasonable and effective, can extract more value information
A New Data Layout For Set Intersection on GPUs
Set intersection is the core in a variety of problems, e.g. frequent itemset
mining and sparse boolean matrix multiplication. It is well-known that large
speed gains can, for some computational problems, be obtained by using a
graphics processing unit (GPU) as a massively parallel computing device.
However, GPUs require highly regular control flow and memory access patterns,
and for this reason previous GPU methods for intersecting sets have used a
simple bitmap representation. This representation requires excessive space on
sparse data sets. In this paper we present a novel data layout, "BatMap", that
is particularly well suited for parallel processing, and is compact even for
sparse data.
Frequent itemset mining is one of the most important applications of set
intersection. As a case-study on the potential of BatMaps we focus on frequent
pair mining, which is a core special case of frequent itemset mining. The main
finding is that our method is able to achieve speedups over both Apriori and
FP-growth when the number of distinct items is large, and the density of the
problem instance is above 1%. Previous implementations of frequent itemset
mining on GPU have not been able to show speedups over the best single-threaded
implementations.Comment: A version of this paper appears in Proceedings of IPDPS 201
Data distribution and performance optimization models for parallel data mining
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 117-128.We have embarked upon a multitude of approaches to improve the efficiency of
selected fundamental tasks in data mining. The present thesis is concerned with
improving the efficiency of parallel processing methods for large amounts of data.
We have devised new parallel frequent itemset mining algorithms that work on
both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the
all-pairs similarity problem.
Two new parallel frequent itemset mining (FIM) algorithms named NoClique
and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm
named bitdrill, and uses a method based on graph partitioning by vertex
separator (GPVS) to distribute and selectively replicate items. The method operates
on a graph where vertices correspond to frequent items and edges correspond
to frequent itemsets of size two. We show that partitioning this graph by a vertex
separator is sufficient to decide a distribution of the items such that the
sub-databases determined by the item distribution can be mined independently.
This distribution entails an amount of data replication, which may be reduced
by setting appropriate weights to vertices. The data distribution scheme is used
in the design of two new parallel frequent itemset mining algorithms. Both algorithms
replicate the items that correspond to the separator. NoClique replicates
the work induced by the separator and NoClique2 computes the same work collectively.
Computational load balancing and minimization of redundant or collective
work may be achieved by assigning appropriate load estimates to vertices. The
performance is compared to another parallelization that replicates all items, and
ParDCI algorithm. We introduce another parallel FIM method using a variation of item distribution
with selective item replication. We extend the GPVS model for parallel
FIM we have proposed earlier, by relaxing the condition of independent mining.
Instead of finding independently mined item sets, we may minimize the amount of
communication and partition the candidates in a fine-grained manner. We introduce
a hypergraph partitioning model of the parallel computation where vertices
correspond to candidates and hyperedges correspond to items. A load estimate is
assigned to each candidate with vertex weights, and item frequencies are given as
hyperedge weights. The model is shown to minimize data replication and balance
load accurately. We also introduce a re-partitioning model since we can generate
only so many levels of candidates at once, using fixed vertices to model previous
item distribution/replication. Experiments show that we improve over the higher
load imbalance of NoClique2 algorithm for the same problem instances at the
cost of additional parallel overhead.
For the all-pairs similarity problem, we extend recent efficient sequential algorithms
to a parallel setting, and obtain document-wise and term-wise parallelizations
of a fast sequential algorithm, as well as an elegant combination of two
algorithms that yield a 2-D distribution of the data. Two effective algorithmic
optimizations for the term-wise case are reported that make the term-wise parallelization
feasible. These optimizations exploit local pruning and block processing
of a number of vectors, in order to decrease communication costs, the number of
candidates, and communication/computation imbalance. The correctness of local
pruning is proven. Also, a recursive term-wise parallelization is introduced. The
performance of the algorithms are shown to be favorable in extensive experiments,
as well as the utility of two major optimizations.Özkural, ErayPh.D
Fast and Accurate Mining of Correlated Heavy Hitters
The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional
data stream has been introduced recently, and a deterministic algorithm based
on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to
solve it. In this paper we present a new counter-based algorithm for tracking
CHHs, formally prove its error bounds and correctness and show, through
extensive experimental results, that our algorithm outperforms the Misra--Gries
based algorithm with regard to accuracy and speed whilst requiring
asymptotically much less space
- …