3 research outputs found

    Efficient learning of large sets of locally optimal classification rules

    Full text link
    Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.Comment: article, 40 pages, Machine Learning journal (2023

    Frequent itemsets mining for big data

    No full text
    Frequent Itemsets Mining (FIM) is a fundamental mining model and plays an important role in Data Mining. It has a vast range of application fields and can be employed as a key calculation phase in many other mining models such as Association Rules, Correlations, Classifications, etc. Generally speaking, FIM counts the frequencies of co-occurrence items, called itemsets, in records of distinct items from a transaction-oriented dataset. The mining model discovers all frequent itemsets whose frequencies are not lesser than a given threshold, called support threshold (). There have been many serial algorithms with different approaches to address the execution performance and consumed memory. However, in the current era of Big Data, the serial algorithms are inadequate to face the problems of running time and memory scalability for very large-scale datasets. Big Data has come out with such challenges for FIM, but it leaves big motivations for us. In order to confront these challenges, we have proposed solutions with three methodological approaches: incremental mining, shared-memory parallel, and distributed parallel. First of all, we propose the IPPC tree and IFIN algorithm which provides incremental tree construction for PPC tree and incremental mining for one of the state-of-the-art algorithms FIN. IPPC tree construction is independent to support threshold () and the order of items. Therefore, the tree allows a previously constructed tree to be built up with a new additional dataset without wasting time to rebuild the tree with the old dataset. IFIN also possesses its own incremental mining in which some portions of the mining task are skipped when mining with difference values. In the scenario of incremental data accumulation, and especially the mining time at different values with an unchanged dataset, our experimental results showed that IFIN was the most efficient in time and memory consuming compared with the well-known algorithm FP-Growth, and two state-of-the-art ones, FIN and PrePost+. Secondly, we propose a shared-memory parallel solution for our incremental algorithm IFIN, named IFIN+. In the solution, most portions of the serial version were redesigned to increase the efficiency and computational independence for convenience in parallel computation with the load balance model, Work-Pool. As a result, IFIN+s computational throughput and efficiency increase significantly compared with those of its serial version. ^Thirdly, we have realized that in the case of datasets comprising a large number of distinguishing items but just a small percentage of frequent items, IPPC tree becomes to lose its advantages of running time and memory for the tree construction compared to those of FIN and PrePost+. Therefore, an improved version of the IPPC tree was proposed, called IPPC+, to increase the tree construction performance. The main idea is that child nodes are placed in a certain order, for example, item name-based order, which the binary-search can be applied to accelerate finding child nodes merged with items in transactions. Fourthly, we apply our second and third solutions for the state-of-the-art algorithm PrePost+ to run as the locally powerful algorithm in our distributed parallel algorithm, named DP3 (Distributed PrePostPlus), which operates in Master-Slaves model. Slaves mine and send local frequent itemsets and support counts to the Master for aggregations. In the case of tremendous numbers of itemsets transferred between the Slaves and Master, the computational load at the Master, therefore, is extremely heavy if there is not the support from our complete FPO tree (Frequent Patterns Organization) which provides optimal compactness for light data transfers and highly efficient aggregations with pruning ability. Processing phases of the Slaves and Master are designed for memory scalability and shared-memory parallel in Work-Pool model so as to utilize the computational power of multi-core CPUs. Besides, load balance in different aspects is also considered thoroughly for the best performance. We conducted experiments on both synthetic and real datasets, and the empirical results have shown that our algorithm far outperforms the well-known distributed algorithm PFP and other three recently high-performance ones Dist-Eclat, BigFIM, and MapFIM. ^Lastly, with the same purpose like our FPO tree, we propose a bijective mapping (bijection) which maps numeric sets with fixed or variant sizes to numbers (mapping numbers) and can convert the numbers to the corresponding numeric sets. The mapping guarantees order-preservation and is optimal in the utilization of the numeric space and can perform with very high efficiency. Some application cases are introduced to empirically show the methods advantage which provides a significant reduction of occupied memory and computation overhead compared with the other presentation forms of data. Besides, our mapping is inherently a (Minimal) Perfect hash function so it can inherit applications of the hash functions. The mapping somewhat can be employed as a potential tool in some Big Data applications to consolidate the problems of consumed memory and performance.Author Van Quoc Phuong HuynhUniversität Linz, Dissertation, 2019OeBB(VLID)438010
    corecore