Search CORE

58 research outputs found

Recommended from our members

Optimizing Frequency Queries for Data Mining Applications

Author: Kender John R.
Malik Hassan H.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2007
Field of study

Data mining algorithms use various Trie and bitmap-based representations to optimize the support (i.e., frequency) counting performance. In this paper, we compare the memory requirements and support counting performance of FP Tree, and Compressed Patricia Trie against several novel variants of vertical bit vectors. First, borrowing ideas from the VLDB domain, we compress vertical bit vectors using WAH encoding. Second, we evaluate the Gray code rank-based transaction reordering scheme, and show that in practice, simple lexicographic ordering, obtained by applying LSB Radix sort, outperforms this scheme. Led by these results, we propose HDO, a novel Hamming-distance-based greedy transaction reordering scheme, and aHDO, a linear-time approximation to HDO. We present results of experiments performed on 15 common datasets with varying degrees of sparseness, and show that HDO- reordered, WAH encoded bit vectors can take as little as 5% of the uncompressed space, while aHDO achieves similar compression on sparse datasets. Finally, with results from over a billion database and data mining style frequency query executions, we show that bitmap-based approaches result in up to hundreds of times faster support counting, and HDO-WAH encoded bitmaps offer the best space-time tradeoff

Columbia University Academic Commons

MBA: Market Basket Analysis Using Frequent Pattern Mining Techniques

Author: Fageeri Sallam Osman
Kausar Mohammad Abu
Soosaimanickam Arockiasamy
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 17/05/2023
Field of study

This Market Basket Analysis (MBA) is a data mining technique that uses frequent pattern mining algorithms to discover patterns of co-occurrence among items that are frequently purchased together. It is commonly used in retail and e-commerce businesses to generate association rules that describe the relationships between different items, and to make recommendations to customers based on their previous purchases. MBA is a powerful tool for identifying patterns of co-occurrence and generating insights that can improve sales and marketing strategies. Although a numerous works has been carried out to handle the computational cost for discovering the frequent itemsets, but it still needs more exploration and developments. In this paper, we introduce an efficient Bitwise-Based data structure technique for mining frequent pattern in large-scale databases. The algorithm scans the original database once, using the Bitwise-Based data representations as well as vertical database layout, compared to the well-known Apriori and FP-Growth algorithm. Bitwise-Based technique enhance the problems of multiple passes over the original database, hence, minimizes the execution time. Extensive experiments have been carried out to validate our technique, which outperform Apriori, Éclat, FP-growth, and H-mine in terms of execution time for Market Basket Analysis

International Journal on Recent and Innovation Trends in Computing and Communication

Memory-Efficient Frequent-Itemset Mining

Author: Gemulla Rainer
Lehner Wolfgang
Schlegel Benjamin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/09/2022
Field of study

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

MR-Radix: a multi-relational data mining algorithm

Author: Cansian Adriano
Colombini Angelo
Corrêa Pedro Luiz Pizzigatti
Oyama Fernando
Scarpelini Neto Paulo
Souza Rogéria de
Valêncio Carlos
Publication venue
Publication date: 01/01/2012
Field of study

Abstract\ud \ud \ud \ud Background\ud \ud Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach.\ud \ud \ud \ud Methods\ud \ud Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix.\ud \ud \ud \ud Results\ud \ud This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine.\ud \ud \ud \ud Conclusion\ud \ud The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.This project was financed by CAPES. We thank David R. M. Mercer for English language review and translation

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Frequent itemset mining on multiprocessor systems

Author: Schlegel Benjamin
Publication venue
Publication date: 30/05/2013
Field of study

Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

Technische Universität Dresden: Qucosa

Comparison of deposition methods of ZnO thin film on flexible substrate

Author: Ahmad M. K.
Arsat R.
M. Idris A. A.
Sidek F.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/01/2017
Field of study

This paper reports the effect of the different deposition methods towards the ZnO nanostructure crystal quality and film thickness on the polyimide substrate. The ZnO film has been deposited by using the spray pyrolysis technique, sol-gel and RF Sputtering. Different methods give a different nanostructure of the ZnO thin film. Sol gel methods, results of nanoflowers ZnO thin film with the thickness of thin film is 600nm. It also produces the best of the piezoelectric effect in term of electrical performance, which is 5.0 V and 12 MHz of frequency which is higher than other frequency obtained by spray pyrolysis and RF sputtering

IAES journal

UTHM Institutional Repository

Universiti Teknologi Malaysia Institutional Repository

TR-2009001: BISC: A Binary Itemset Support Counting Approach towards Efficient Frequent Itemset Mining

Author: Chen Jinlin
Xiao Keli
Publication venue: CUNY Academic Works
Publication date: 01/01/2009
Field of study

City University of New York

Impelementasi Struktur Data Patricia Tree pada Autocomplete Seacrh Box

Author: Zusni Adisya
Publication venue: Universitas Telkom
Publication date: 01/01/2011
Field of study

ABSTRAKSI: Autocomplete pada search box berhubungan dengan data yang begitu besar. Sehingga ketika dilakukan pencarian frase/kata pada database terdapat kendala, dimana ketika semua frase harus ditelusuri untuk mendapatkan hasil dan terdapat hubungan antara server dan client, maka hal seperti ini akan membebani kinerja server. Sehubungan dengan itu, diperlukan metode khusus dalam hal pengambilan data, agar prosesnya ringan dan cepat. Salah satu yang dapat dilakukan yaitu dengan menggunakan suatu struktur data patricia tree.Penggunaan patricia tree didasarkan karena pencarian dilakukan pada frase awal dari keseluruhan frase yang diinginkan. Sehingga ketika dilakukan pencarian pada patricia tree, tidak perlu menelusuri semua struktur patricia tree, cukup pada struktur patricia tree yang karakter awalnya sesuai saja. Pada patricia tree ini node yang dibangun bisa diberi bobot, sehingga pada kasus ini pemunculan suggestion dapat diprioritaskan berdasarkan bobotnya.Setelah dilakukan pengujian, penelitian ini memberikan hasil bahwa patricia tree mampu memberikan respon hasil pencarian yang lebih cepat dibandingkan prefiks tree sebagai struktur data pembanding. Pembentukan tree dengan pemberian bobot juga memberikan hasil yang lebih baik dalam ketepatan pencarian.Kata Kunci : patricia tree, trie, autocomplete, search box.ABSTRACT: Autocomplete in the search box associated with a large of data. So when do the search phrase / word in the database there are constraints, which when all the phrases must be traced to obtain the results and there is a connection between the server and client, this would overload the server\u27s performance. Accordingly, a methods are needed specifically in terms of data retrieval, so that the process is lightweight and fast. One that can be done by using a patricia tree data structure.Patricia tree is based on the use of a search performed on the initial phrase of the whole phrase desired. So when do a search on the patricia tree, no need to browse through all the patricia tree structure, simply on the structure of the character originally patricia tree corresponding course. Patricia tree at this node is built can be weighted, so that in this case the appearance of suggestion can be prioritized based on its weight.After testing, this study provides results that patricia tree capable of providing search results more quickly than a prefix tree data structure as a comparison. The establishment of tree by assigning weights also gives better results in search accuracy.Keyword: patricia tree, trie, autocomplete, search box

Open Library

Frequent Itemset Mining for Big Data

Author: Pulvirenti Fabio
Publication venue: Politecnico di Torino
Publication date: 01/01/2017
Field of study

Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino