851 research outputs found

    Exploring Decomposition for Solving Pattern Mining Problems

    Get PDF
    This article introduces a highly efficient pattern mining technique called Clustering-based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in the transaction database based on clustering techniques. The set of transactions is first clustered, such that highly correlated transactions are grouped together. Next, we derive the relevant patterns by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one applying an approximation-based strategy and another based on an exact strategy. The approximation-based strategy takes into account only the clusters, whereas the exact strategy takes into account both clusters and shared items between clusters. To boost the performance of the CBPM, a GPU-based implementation is investigated. To evaluate the CBPM framework, we perform extensive experiments on several pattern mining problems. The results from the experimental evaluation show that the CBPM provides a reduction in both the runtime and memory usage. Also, CBPM based on the approximate strategy provides good accuracy, demonstrating its effectiveness and feasibility. Our GPU implementation achieves significant speedup of up to 552× on a single GPU using big transaction databases.publishedVersio

    Clustering for Data Reduction: A Divide and Conquer Approach

    Get PDF
    We consider the problem of reducing a potentially very large dataset to a subset of representative prototypes. Rather than searching over the entire space of prototypes, we first roughly divide the data into balanced clusters using bisecting k-means and spectral cuts, and then find the prototypes for each cluster by affinity propagation. We apply our algorithm to text data, where we perform an order of magnitude faster than simply looking for prototypes on the entire dataset. Furthermore, our "divide and conquer" approach actually performs more accurately on datasets which are well bisected, as the greedy decisions of affinity propagation are confined to classes of already similar items

    Deterministic Modularity Optimization

    Get PDF
    We study community structure of networks. We have developed a scheme for maximizing the modularity Q based on mean field methods. Further, we have defined a simple family of random networks with community structure; we understand the behavior of these networks analytically. Using these networks, we show how the mean field methods display better performance than previously known deterministic methods for optimization of Q.Comment: 7 pages, 4 figures, minor change

    ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers.</p> <p>Results</p> <p>The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface.</p> <p>Conclusion</p> <p>ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.</p

    A novel clustering based method for characterizing household electricity consumption profiles

    Get PDF
    A new methodology based on expert knowledge and data mining is proposed to obtain data-driven models that characterize household consumption profiles. These profiles are useful for electricity marketers to understand their customers’ consumption. They could then adjust their electricity purchases in the market and provide recommendations to their customers to manage their consumption. The novelty of this research work is proposing a new procedure to determine an adequate number of clusters for a clustering task. Therefore, the proposed new methodology includes this novel procedure to build the models in two phases. In the first phase, clustering algorithms are used to group the data using different numbers of clusters. For the second phase, a new procedure (k-ISAC_TLP) is proposed and used to select the most appropriate number of clusters. This methodology allows the inclusion of domain information. In the case of household electricity consumption, where only groups with a significant number are relevant as long as the error is small, it allows the use of metrics like the mean absolute error and the number of observations (daily electricity consumption profiles). According to experts, the results achieved in two real datasets (from Spain and Ireland), with millions of observations support the methodology and reveal novel knowledge. In both cases, two and a half million observations have been analyzed and around twenty electricity consumption profiles have been detected. The methodology is easily extensible to problems of any domain where clustering algorithms are applicable. A software solution has been implemented and made freely available.Funding for open access charge: Universidad de Málaga/CBUA . The authors would like to thank the University College Dublin Library the access to the Irish Social Science Data Archive (ISSDA). This work was supported by Grant RTI2018-095097-B-I00 funded by MCIN (Spain), Grant CPP2021-008403 funded by MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario

    Get PDF
    Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology. This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis as well as categorizing new unlabeled records. A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also discussed from a medical perspective
    corecore