80,922 research outputs found

    Heuristic Approaches for Generating Local Process Models through Log Projections

    Full text link
    Local Process Model (LPM) discovery is focused on the mining of a set of process models where each model describes the behavior represented in the event log only partially, i.e. subsets of possible events are taken into account to create so-called local process models. Often such smaller models provide valuable insights into the behavior of the process, especially when no adequate and comprehensible single overall process model exists that is able to describe the traces of the process from start to end. The practical application of LPM discovery is however hindered by computational issues in the case of logs with many activities (problems may already occur when there are more than 17 unique activities). In this paper, we explore three heuristics to discover subsets of activities that lead to useful log projections with the goal of speeding up LPM discovery considerably while still finding high-quality LPMs. We found that a Markov clustering approach to create projection sets results in the largest improvement of execution time, with discovered LPMs still being better than with the use of randomly generated activity sets of the same size. Another heuristic, based on log entropy, yields a more moderate speedup, but enables the discovery of higher quality LPMs. The third heuristic, based on the relative information gain, shows unstable performance: for some data sets the speedup and LPM quality are higher than with the log entropy based method, while for other data sets there is no speedup at all.Comment: paper accepted and to appear in the proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), special session on Process Mining, part of the Symposium Series on Computational Intelligence (SSCI

    Explicit probabilistic models for databases and networks

    Full text link
    Recent work in data mining and related areas has highlighted the importance of the statistical assessment of data mining results. Crucial to this endeavour is the choice of a non-trivial null model for the data, to which the found patterns can be contrasted. The most influential null models proposed so far are defined in terms of invariants of the null distribution. Such null models can be used by computation intensive randomization approaches in estimating the statistical significance of data mining results. Here, we introduce a methodology to construct non-trivial probabilistic models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt models allow for the natural incorporation of prior information. Furthermore, they satisfy a number of desirable properties of previously introduced randomization approaches. Lastly, they also have the benefit that they can be represented explicitly. We argue that our approach can be used for a variety of data types. However, for concreteness, we have chosen to demonstrate it in particular for databases and networks.Comment: Submitte

    Outlier Detection Method on UCI Repository Dataset by Entropy Based Rough K-means

    Get PDF
    Rough set theory is used to handle uncertainty and incomplete information by applying two sets, lower and upper approximation. In this paper, the clustering process is improved by adapting the preliminary centroid selection method on rough K-means (RKM) algorithm. The entropy based rough K-means (ERKM) method is developed by adapting entropy based preliminary centroids selection on RKM and executed and also validated by cluster validity indexes. An example shows that the ERKM performs effectively by selection of entropy based preliminary centroid. In addition, Outlier detection is an important task in data mining and very much different from the rest of the objects in the cluster. Entropy based rough outlier factor (EROF) method is used to detect outlier effectively for yeast dataset. An example shows that EROF detects outlier effectively on protein localisation sites and ERKM clustering algorithm performed effectively. Further, experimental readings show that the ERKM and EROF method outperformed the other methods.

    CBTS: Correlation based transformation strategy for privacy preserving data mining

    Get PDF
    Mining useful knowledge from corpus of data has become an important application in many fields. Data mining algorithms like clustering, classification work on this data and provide crisp information for analysis. As these data are available through various channels into public domain, privacy for the owners of the data is increasing need. Though privacy can be provided by hiding sensitive data, it will affect the data mining algorithms in knowledge extraction, so an effective mechanism is required to provide privacy to the data and at the same time without affecting the data mining algorithms. Privacy concern is a primary hindrance for quality data analysis. Data mining algorithms on the contrary focus on the mathematical nature than on the private nature of the information. Therefore instead of removing or encrypting sensitive data, we propose transformation strategies that retain the statistical, semantic and heuristic nature of the data while masking the sensitive information. The proposed Correlation Based Transformation Strategy (CBTS) combines Correlation Analysis in tandem with data transformation techniques such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA) and Non Negative Matrix Factorization (NNMF) provides the intended level of privacy preservation and enables data analysis. The outcome of CBTS is evaluated on standard datasets against popular data mining techniques with significant success and Information Entropy is also accounted

    Parametric entropy based Cluster Centriod Initialization for k-means clustering of various Image datasets

    Full text link
    One of the most employed yet simple algorithm for cluster analysis is the k-means algorithm. k-means has successfully witnessed its use in artificial intelligence, market segmentation, fraud detection, data mining, psychology, etc., only to name a few. The k-means algorithm, however, does not always yield the best quality results. Its performance heavily depends upon the number of clusters supplied and the proper initialization of the cluster centroids or seeds. In this paper, we conduct an analysis of the performance of k-means on image data by employing parametric entropies in an entropy based centroid initialization method and propose the best fitting entropy measures for general image datasets. We use several entropies like Taneja entropy, Kapur entropy, Aczel Daroczy entropy, Sharma Mittal entropy. We observe that for different datasets, different entropies provide better results than the conventional methods. We have applied our proposed algorithm on these datasets: Satellite, Toys, Fruits, Cars, Brain MRI, Covid X-Ray.Comment: 6 Pages, 2 tables, one algorithm. Accepted for publication in IEEE International Conference on Signal Processing and Computer Vision (SPCV-2023

    Information mining of customers preferences for product specifications determination using big sales data

    Get PDF
    Product competitiveness is highly influenced by its related design specifications. Information retrieval of customers preferences for the specification determination is essential to product design and development. Big sales data is an emerging resource for mining customers preferences on product specifications. In this work, information entropy is used for customers preferences information quantification on product specifications firstly. Then, a method of information mining for customers preferences estimation is developed by using big sales data. On this basis, a density-based clustering analysis is carried out on customers preferences as a decision support tool for the determination and selection of product design specifications. A case study related to electric bicycle specifications determination using big sales data is reported to illustrate and validate the proposed metho

    Mining of nutritional ingredients in food for disease analysis

    Get PDF
    Suitable nutritional diets have been widely recognized as important measures to prevent and control non-communicable diseases (NCDs). However, there is little research on nutritional ingredients in food now, which are beneficial to the rehabilitation of NCDs. In this paper, we profoundly analyzed the relationship between nutritional ingredients and diseases by using data mining methods. First, more than 7,000 diseases were obtained and we collected the recommended food and taboo food for each disease. Then, referring to the China Food Nutrition, we used noise-intensity and information entropy to find out which nutritional ingredients can exert positive effects on diseases. Finally, we proposed an improved algorithm named CVNDA_Red based on rough sets to select the corresponding core ingredients from the positive nutritional ingredients. To the best of our knowledge, this is the first study to discuss the relationship between nutritional ingredients in food and diseases through data mining based on rough set theory in China. The experiments on real-life data show that our method based on data mining improves the performance compared with the traditional statistical approach, with the precision of 1.682. Additionally, for some common diseases such as Diabetes, Hypertension and Heart disease, our work is able to identify correctly the first two or three nutritional ingredients in food that can benefit the rehabilitation of those diseases. These experimental results demonstrate the effectiveness of applying data mining in selecting of nutritional ingredients in food for disease analysis
    • …
    corecore