49 research outputs found

    Mining frequent itemsets a perspective from operations research

    Get PDF
    Many papers on frequent itemsets have been published. Besides somecontests in this field were held. In the majority of the papers the focus ison speed. Ad hoc algorithms and datastructures were introduced. Inthis paper we put most of the algorithms in one framework, usingclassical Operations Research paradigms such as backtracking, depth-first andbreadth-first search, and branch-and-bound. Moreover we presentexperimental results where the different algorithms are implementedunder similar designs.data mining;operation research;Frequent itemsets

    Frequent itemset mining: technique to improve eclat based algorithm

    Get PDF
    In frequent itemset mining, the main challenge is to discover relationships between data in a transactional database or relational database. Various algorithms have been introduced to process frequent itemset. Eclat based algorithms are one of the prominent algorithm used for frequent itemset mining. Various researches have been conducted based on Eclat based algorithm such as Tidset, dEclat, Sortdiffset and Postdiffset. The algorithm has been improvised along the time. However, the utilization of physical memory and processing time become the main problem in this process. This paper reviews and presents a comparison of various Eclat based algorithms for frequent itemset mining and propose an enhancement technique of Eclat based algorithm to reduce processing time and memory usage. The experimental result shows some improvement in processing time and memory utilization in frequent itemset mining

    i-Eclat: performance enhancement of eclat via incremental approach in frequent itemset mining

    Get PDF
    One example of the state-of-the-art vertical rule mining technique is called equivalence class transformation (Eclat) algorithm. Neither horizontal nor vertical data format, both are still suffering from the huge memory consumption. In response to the promising results of mining in a higher volume of data from a vertical format, and taking consideration of dynamic transaction of data in a database, the research proposes a performance enhancement of Eclat algorithm that relies on incremental approach called an Incremental-Eclat (i-Eclat) algorithm. Motivated from the fast intersection in Eclat, this algorithm of performance enhancement adopts via my structured query language (MySQL) database management system (DBMS) as its platform. It serves as the association rule mining database engine in testing benchmark frequent itemset mining (FIMI) datasets from online repository. The MySQL DBMS is chosen in order to reduce the preprocessing stages of datasets. The experimental results indicate that the proposed algorithm outperforms the traditional Eclat with 17% both in chess and T10I4D100K, 69% in mushroom, 5% and 8% in pumsb_star and retail datasets. Thus, among five (5) dense and sparse datasets, the average performance of i-Eclat is concluded to be 23% better than Eclat

    New Spark solutions for distributed frequent itemset and association rule mining algorithms

    Get PDF
    Funding for open access publishing: Universidad de Gran- ada/CBUA. The research reported in this paper was partially sup- ported by the BIGDATAMED project, which has received funding from the Andalusian Government (Junta de Andalucı ́a) under grant agreement No P18-RT-1765, by Grants PID2021-123960OB-I00 and Grant TED2021-129402B-C21 funded by Ministerio de Ciencia e Innovacio ́n and, by ERDF A way of making Europe and by the European Union NextGenerationEU. In addition, this work has been partially supported by the Ministry of Universities through the EU- funded Margarita Salas programme NextGenerationEU. Funding for open access charge: Universidad de Granada/CBUAThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform which has been demonstrated to outperform existing distributive algorithmic implementations.Universidad de Granada/CBUAJunta de Andalucia P18-RT-1765Ministry of Science and Innovation, Spain (MICINN) Instituto de Salud Carlos III Spanish Government PID2021-123960OB-I00, TED2021-129402B-C21ERDF A way of making EuropeEuropean Union NextGenerationEUMinistry of Universities through the E

    Postdiffset Algorithm in Rare Pattern: An Implementation via Benchmark Case Study

    Get PDF
    Frequent and infrequent itemset mining are trending in data mining techniques. The pattern of Association Rule (AR) generated will help decision maker or business policy maker to project for the next intended items across a wide variety of applications. While frequent itemsets are dealing with items that are most purchased or used, infrequent items are those items that are infrequently occur or also called rare items. The AR mining still remains as one of the most prominent areas in data mining that aims to extract interesting correlations, patterns, association or casual structures among set of items in the transaction databases or other data repositories. The design of database structure in association rules mining algorithms are based upon horizontal or vertical data formats. These two data formats have been widely discussed by showing few examples of algorithm of each data formats. The efforts on horizontal format suffers in huge candidate generation and multiple database scans which resulting in higher memory consumptions. To overcome the issue, the solutions on vertical approaches are proposed. One of the established algorithms in vertical data format is Eclat.ECLAT or Equivalence Class Transformation algorithm is one example solution that lies in vertical database format. Because of its, fast intersection‟, in this paper, we analyze the fundamental Eclat and Eclatvariants such asdiffsetand sortdiffset. In response to vertical data format and as a continuity to Eclat extension, we propose a postdiffset algorithm as a new member in Eclat variants that use tidset format in the first looping and diffset in the later looping. In this paper, we present the performance of Postdiffset algorithm prior to implementation in mining of infrequent or rare itemset.Postdiffset algorithm outperforms 23% and 84% to diffset and sortdiffset in mushroom and 94% and 99% to diffset and sortdiffset in retail dataset

    Mining frequent itemsets a perspective from operations research

    Get PDF
    Many papers on frequent itemsets have been published. Besides some contests in this field were held. In the majority of the papers the focus is on speed. Ad hoc algorithms and datastructures were introduced. In this paper we put most of the algorithms in one framework, using classical Operations Research paradigms such as backtracking, depth-first and breadth-first search, and branch-and-bound. Moreover we present experimental results where the different algorithms are implemented under similar designs

    A STUDY ON EFFICIENT DATA MINING APPROACH ON COMPRESSED TRANSACTION

    Get PDF
    Data mining can be viewed as a result of the natural evolution of information technology. The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for data compression, that is, the ability to reduce the amount of storage or Internet bandwidth required to handle the data. This paper analysis the various data mining approaches which is used to compress the original database into a smaller one and perform the data mining process for compressed transaction such as M2TQT,PINCER-SEARCH algorithm, APRIOR

    Machine Learning Approaches for Breast Cancer Survivability Prediction

    Get PDF
    Breast cancer is one of the leading causes of cancer death in women. If not diagnosed early, the 5-year survival rate of patients is just about 26\%. Furthermore, patients with similar phenotypes can respond differently to the same therapies, which means the therapies might not work well for some of them. Identifying biomarkers that can help predict a cancer class with high accuracy is at the heart of breast cancer studies because they are targets of the treatments and drug development. Genomics data have been shown to carry useful information for breast cancer diagnosis and prognosis, as well as uncovering the disease’s mechanism. Machine learning methods are powerful tools to find such information. Feature selection methods are often utilized in supervised learning and unsupervised learning tasks to deal with data containing a large number of features in which only a small portion of them are useful to the classification task. On the other hand, analyzing only one type of data, without reference to the existing knowledge about the disease and the therapies, might mislead the findings. Effective data integration approaches are necessary to uncover this complex disease. In this thesis, we apply and develop machine learning methods to identify meaningful biomarkers for breast cancer survivability prediction after a certain treatment. They include applying feature selection methods on gene-expression data to derived gene-signatures, where the initial genes are collected concerning the mechanism of some drugs used breast cancer therapies. We also propose a new feature selection method, named PAFS, and apply it to discover accurate biomarkers. In addition, it has been increasingly supported that, sub-network biomarkers are more robust and accurate than gene biomarkers. We proposed two network-based approaches to identify sub-network biomarkers for breast cancer survivability prediction after a treatment. They integrate gene-expression data with protein-protein interactions during the optimal sub-network searching process and use cancer-related genes and pathways to prioritize the extracted sub-networks. The sub-network search space is usually huge and many proteins interact with thousands of other proteins. Thus, we apply some heuristics to avoid generating and evaluating redundant sub-networks

    Benefits and limitations of data assimilation for discharge forecasting using an event-based rainfall–runoff model

    Get PDF
    Mediterranean catchments in southern France are threatened by potentially devastating fast floods which are difficult to anticipate. In order to improve the skill of rainfall-runoff models in predicting such flash floods, hydrologists use data assimilation techniques to provide real-time updates of the model using observational data. This approach seeks to reduce the uncertainties present in different components of the hydrological model (forcing, parameters or state variables) in order to minimize the error in simulated discharges. This article presents a data assimilation procedure, the best linear unbiased estimator (BLUE), used with the goal of improving the peak discharge predictions generated by an event-based hydrological model Soil Conservation Service lag and route (SCS-LR). For a given prediction date, selected model inputs are corrected by assimilating discharge data observed at the basin outlet. This study is conducted on the Lez Mediterranean basin in southern France. The key objectives of this article are (i) to select the parameter(s) which allow for the most efficient and reliable correction of the simulated discharges, (ii) to demonstrate the impact of the correction of the initial condition upon simulated discharges, and (iii) to identify and understand conditions in which this technique fails to improve the forecast skill. The correction of the initial moisture deficit of the soil reservoir proves to be the most efficient control parameter for adjusting the peak discharge. Using data assimilation, this correction leads to an average of 12% improvement in the flood peak magnitude forecast in 75% of cases. The investigation of the other 25% of cases points out a number of precautions for the appropriate use of this data assimilation procedure
    corecore