75,619 research outputs found

    The Rough Set-Based Algorithm for Two Steps

    Get PDF
    [[abstract]]The previous research in mining association rules pays no attention to finding rules from imprecise data, and the traditional data mining cannot solve the multi-policy-making problem. urthermore, in this research, we incorporate association rules with rough sets and promote a new point of view in applications. The new approach can be applied for finding association rules, which has the ability to handle uncertainty combined with rough set theory. In the research, first, we provide new algorithms modified from Apriori algorithm and then give an illustrative example. Finally, give some suggestion based on knowledge management as a reference for future research.[[notice]]補正完畢[[incitationindex]]EI[[booktype]]紙

    Non-redundant sequential association rule mining based on closed sequential patterns

    Get PDF
    In many applications, e.g., bioinformatics, web access traces, system utilisation logs, etc., the data is naturally in the form of sequences. People have taken great interest in analysing the sequential data and finding the inherent characteristics or relationships within the data. Sequential association rule mining is one of the possible methods used to analyse this data. As conventional sequential association rule mining very often generates a huge number of association rules, of which many are redundant, it is desirable to find a solution to get rid of those unnecessary association rules. Because of the complexity and temporal ordered characteristics of sequential data, current research on sequential association rule mining is limited. Although several sequential association rule prediction models using either sequence constraints or temporal constraints have been proposed, none of them considered the redundancy problem in rule mining. The main contribution of this research is to propose a non-redundant association rule mining method based on closed frequent sequences and minimal sequential generators. We also give a definition for the non-redundant sequential rules, which are sequential rules with minimal antecedents but maximal consequents. A new algorithm called CSGM (closed sequential and generator mining) for generating closed sequences and minimal sequential generators is also introduced. A further experiment has been done to compare the performance of generating non-redundant sequential rules and full sequential rules, meanwhile, performance evaluation of our CSGM and other closed sequential pattern mining or generator mining algorithms has also been conducted. We also use generated non-redundant sequential rules for query expansion in order to improve recommendations for infrequently purchased products

    Computing iceberg concept lattices with Titanic

    Get PDF
    International audienceWe introduce the notion of iceberg concept lattices and show their use in knowledge discovery in databases. Iceberg lattices are a conceptual clustering method, which is well suited for analyzing very large databases. They also serve as a condensed representation of frequent itemsets, as starting point for computing bases of association rules, and as a visualization method for association rules. Iceberg concept lattices are based on the theory of Formal Concept Analysis, a mathematical theory with applications in data analysis, information retrieval, and knowledge discovery. We present a new algorithm called TITANIC for computing (iceberg) concept lattices. It is based on data mining techniques with a level-wise approach. In fact, TITANIC can be used for a more general problem: Computing arbitrary closure systems when the closure operator comes along with a so-called weight function. The use of weight functions for computing closure systems has not been discussed in the literature up to now. Applications providing such a weight function include association rule mining, functional dependencies in databases, conceptual clustering, and ontology engineering. The algorithm is experimentally evaluated and compared with Ganter's Next-Closure algorithm. The evaluation shows an important gain in efficiency, especially for weakly correlated data

    Techniques for improving clustering and association rules mining from very large transactional databases

    Get PDF
    Clustering and association rules mining are two core data mining tasks that have been actively studied by data mining community for nearly two decades. Though many clustering and association rules mining algorithms have been developed, no algorithm is better than others on all aspects, such as accuracy, efficiency, scalability, adaptability and memory usage. While more efficient and effective algorithms need to be developed for handling the large-scale and complex stored datasets, emerging applications where data takes the form of streams pose new challenges for the data mining community. The existing techniques and algorithms for static stored databases cannot be applied to the data streams directly. They need to be extended or modified, or new methods need to be developed to process the data streams.In this thesis, algorithms have been developed for improving efficiency and accuracy of clustering and association rules mining on very large, high dimensional, high cardinality, sparse transactional databases and data streams.A new similarity measure suitable for clustering transactional data is defined and an incremental clustering algorithm, INCLUS, is proposed using this similarity measure. The algorithm only scans the database once and produces clusters based on the user’s expectations of similarities between transactions in a cluster, which is controlled by the user input parameters, a similarity threshold and a support threshold. Intensive testing has been performed to evaluate the effectiveness, efficiency, scalability and order insensitiveness of the algorithm.To extend INCLUS for transactional data streams, an equal-width time window model and an elastic time window model are proposed that allow mining of clustering changes in evolving data streams. The minimal width of the window is determined by the minimum clustering granularity for a particular application. Two algorithms, CluStream_EQ and CluStream_EL, based on the equal-width window model and the elastic window model respectively, are developed by incorporating these models into INCLUS. Each algorithm consists of an online micro-clustering component and an offline macro-clustering component. The online component writes summary statistics of a data stream to the disk, and the offline components uses those summaries and other user input to discover changes in a data stream. The effectiveness and scalability of the algorithms are evaluated by experiments.This thesis also looks into sampling techniques that can improve efficiency of mining association rules in a very large transactional database. The sample size is derived based on the binomial distribution and central limit theorem. The sample size used is smaller than that based on Chernoff Bounds, but still provides the same approximation guarantees. The accuracy of the proposed sampling approach is theoretically analyzed and its effectiveness is experimentally evaluated on both dense and sparse datasets.Applications of stratified sampling for association rules mining is also explored in this thesis. The database is first partitioned into strata based on the length of transactions, and simple random sampling is then performed on each stratum. The total sample size is determined by a formula derived in this thesis and the sample size for each stratum is proportionate to the size of the stratum. The accuracy of transaction size based stratified sampling is experimentally compared with that of random sampling.The thesis concludes with a summary of significant contributions and some pointers for further work

    MRQAR: A generic MapReduce framework to discover quantitative association rules in big data problems

    Get PDF
    Many algorithms have emerged to address the discovery of quantitative association rules from datasets in the last years. However, this task is becoming a challenge because the processing power of most existing techniques is not enough to handle the large amount of data generated nowadays. These vast amounts of data are known as Big Data. A number of previous studies have been focused on mining boolean or nominal association rules from Big Data problems, nevertheless, the data in real-world applications usually consist of quantitative values and designing data mining algorithms able to extract quantitative association rules presents a challenge to workers in this research field. In spite of the fact that we can find classical methods to discover boolean or nominal association rules in the most well-known repositories of Big Data algorithms, such repositories do not provide methods to discover quantitative association rules. Indeed, no methodologies have been proposed in the literature without prior discretization in Big Data. Hence, this work proposes MRQAR, a new generic parallel framework to discover quantitative association rules in large amounts of data, designed following the MapReduce paradigm using Apache Spark. MRQAR performs an incremental learning able to run any sequential quantitative association rule algorithm in Big Data problems without needing to redesign such algorithms. As a case study, we have integrated the multiobjective evolutionary algorithm MOPNAR into MRQAR to validate the generic MapReduce framework proposed in this work. The results obtained in the experimental study performed on five Big Data problems prove the capability of MRQAR to obtain reduced set of high quality rules in reasonable time.Ministerio de Economía y Competitividad TIN2017-89517-PMinisterio de Economía y Competitividad TIN2014-55894-C2-1-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-

    Mining Association Rules Events over Data Streams

    Get PDF
    Data streams have gained considerable attention in data analysis and data mining communities because of the emergence of a new classes of applications, such as monitoring, supply chain execution, sensor networks, oilfield and pipeline operations, financial marketing and health data industries. Telecommunication advancements have provided us with easy access to stream data produced by various applications. Data in streams differ from static data stored in data warehouses or database. Data streams are continuous, arrive at high-speeds and change through time. Traditional data mining algorithms assume presence of data in conventional storage means where data mining is performed centrally with the luxury of accessing the data multiple times, using powerful processors, providing offline output with no time constraints. Such algorithms are not suitable for dynamic data streams. Stream data needs to be mined promptly as it might not be feasible to store such volume of data. In addition, streams reflect live status of the environment generating it, so prompt analysis may provide early detection of faults, delays, performance measurements, trend analysis and other diagnostics. This thesis focuses on developing a data stream association rule mining algorithm among co-occurring events. The proposed algorithm mines association rules over data streams incrementally in a centralized setting. We are interested in association rules that meet a provided minimum confidence threshold and have a lift value greater than 1. We refer to such association rules as strong rules. Experiments on several datasets demonstrate that the proposed algorithms is efficient and effective in extracting association rules from data streams, thus having a faster processing time and better memory management

    Postdiffset Algorithm in Rare Pattern: An Implementation via Benchmark Case Study

    Get PDF
    Frequent and infrequent itemset mining are trending in data mining techniques. The pattern of Association Rule (AR) generated will help decision maker or business policy maker to project for the next intended items across a wide variety of applications. While frequent itemsets are dealing with items that are most purchased or used, infrequent items are those items that are infrequently occur or also called rare items. The AR mining still remains as one of the most prominent areas in data mining that aims to extract interesting correlations, patterns, association or casual structures among set of items in the transaction databases or other data repositories. The design of database structure in association rules mining algorithms are based upon horizontal or vertical data formats. These two data formats have been widely discussed by showing few examples of algorithm of each data formats. The efforts on horizontal format suffers in huge candidate generation and multiple database scans which resulting in higher memory consumptions. To overcome the issue, the solutions on vertical approaches are proposed. One of the established algorithms in vertical data format is Eclat.ECLAT or Equivalence Class Transformation algorithm is one example solution that lies in vertical database format. Because of its, fast intersection‟, in this paper, we analyze the fundamental Eclat and Eclatvariants such asdiffsetand sortdiffset. In response to vertical data format and as a continuity to Eclat extension, we propose a postdiffset algorithm as a new member in Eclat variants that use tidset format in the first looping and diffset in the later looping. In this paper, we present the performance of Postdiffset algorithm prior to implementation in mining of infrequent or rare itemset.Postdiffset algorithm outperforms 23% and 84% to diffset and sortdiffset in mushroom and 94% and 99% to diffset and sortdiffset in retail dataset

    Discovery and Effective Use of Frequent Item-set Mining and Association Rules in Datasets

    Get PDF
    The unprecedented rise in digitized data generation has led to the ever-expanding demand for sophisticated storage and analysis methods capable of handling vast amounts of complex data, much of which is stored within many databases. Owing to the large size of such databases, employment of sophisticated analysis methods, such as data mining and machine learning, becomes necessary to extract useful insights regarding a given system under study. Frequent itemset mining and association rules mining represent two key approaches to mining knowledge stored in databases. However, handling of large databases often leads to time-consuming calculations that necessitate large amounts of memory. In this regard, the development of methods capable of enabling faster, less laborious search or pattern discovery remains a central focus in the field of data mining. Incontestably, such methods could aid in faster processing and knowledge extraction, enabling new breakthroughs in how knowledge is acquired from data and applied in real-world applications. However, real-world applications are often hindered by limitations inherent to currently available algorithms. For instance, many itemset mining algorithms are known to first store a given database as a tree structure in memory. However, such algorithms fail to provide a tight upper bound on the number of nodes that will be generated during the tree building process accordingly, there are no upper bounds governing the amount of memory that is needed to generate such trees. As such, practical implementation of frequent itemset mining algorithms is often restricted by memory consumption. However, despite the importance of memory consumption in the applicability of itemset mining, this factor has not drawn adequate attention from the data mining community and remains as a key challenge in its application. In addition, the majority of algorithms widely used and studied to date are known to require multiple database scans, a factor which restricts their applicability for incremental mining applications. In this regard, the development of an algorithm capable of dynamically mining frequent patterns on-the-fly would open new pathways in data mining, enabling the application of itemset mining methods to new real-world applications, in addition to vastly improving current applications. In this thesis, different approaches are proposed in relation to the above-mentioned limitations currently hampering further progress in this significant area of data mining. First, an upper bound on the number of nodes of well-known tree structures in frequent itemset mining is presented. Second, aiming to overcome the memory consumption constraint, a memory-efficient method to store data processed by the frequent itemset mining algorithm is proposed, where instead of a tree, data is stored in a compact directed graph whose nodes represent items. Third, an algorithm is proposed to overcome costly databases scans in the form of a novel SPFP-tree (single pass frequent pattern tree) algorithm. Lastly, approaches that allow for frequent itemset and association rules to be practically and effectively used in real world applications are proposed. First, the quality and effectiveness of frequent itemset mining in solving a real world facility management problem is examined. Second, with aims of improving the quality of recommendations made to users, as well as to overcome the cold-start problem suffered by new users, a hybrid approach is herein proposed for the application of association rules into recommender systems

    Enhancing the Performance of Mining High Utility Itemsets Based On Pattern Algorithm

    Get PDF
    ABSTRACT: Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. An association in data mining indicates a logical dependency between various attributes of an entity. Association rule mining (ARM) is the process of mining past data for association rules. ARM only find the frequency of itemsets, which will not provide large amount of profit. Utility mining focuses on discovering the itemsets with high sales profit. Here, utility mining is a measure of profitability of items to the users. The utility mining of itemsets is an important task in decision-making process of many applications such as website click streaming analysis, cross marketing in retail stores and in biomedical applications. The extraction of the high utility itemsets from a large database involves the creation of new candidate itemsets with high utility. This affects the performance of the mining process in terms of the execution time and the space requirement. In this paper, it is intended to develop an efficient algorithm for mining the high utility itemsets for reducing the candidate itemsets. Here, a data structure named pattern tree would be maintained to store the information about the high utility itemsets, so that the number of database scans can be reduced.
    corecore