Search CORE

851 research outputs found

Exploring Decomposition for Solving Pattern Mining Problems

Author: Djenouri Youcef
Lin Jerry Chun-Wei
Nørvåg Kjetil
Ramampiaro Heri
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

This article introduces a highly efficient pattern mining technique called Clustering-based Pattern Mining (CBPM). This technique discovers relevant patterns by studying the correlation between transactions in the transaction database based on clustering techniques. The set of transactions is first clustered, such that highly correlated transactions are grouped together. Next, we derive the relevant patterns by applying a pattern mining algorithm to each cluster. We present two different pattern mining algorithms, one applying an approximation-based strategy and another based on an exact strategy. The approximation-based strategy takes into account only the clusters, whereas the exact strategy takes into account both clusters and shared items between clusters. To boost the performance of the CBPM, a GPU-based implementation is investigated. To evaluate the CBPM framework, we perform extensive experiments on several pattern mining problems. The results from the experimental evaluation show that the CBPM provides a reduction in both the runtime and memory usage. Also, CBPM based on the approximate strategy provides good accuracy, demonstrating its effectiveness and feasibility. Our GPU implementation achieves significant speedup of up to 552× on a single GPU using big transaction databases.publishedVersio

SINTEF Open

Clustering for Data Reduction: A Divide and Conquer Approach

Author: Andrews Nicholas O.
Fox Edward A.
Publication venue
Publication date: 01/01/2007
Field of study

We consider the problem of reducing a potentially very large dataset to a subset of representative prototypes. Rather than searching over the entire space of prototypes, we first roughly divide the data into balanced clusters using bisecting k-means and spectral cuts, and then find the prototypes for each cluster by affinity propagation. We apply our algorithm to text data, where we perform an order of magnitude faster than simply looking for prototypes on the entire dataset. Furthermore, our "divide and conquer" approach actually performs more accurately on datasets which are well bisected, as the greedy decisions of affinity propagation are confined to classes of already similar items

Computer Science Technical Reports @Virginia Tech

CiteSeerX

FROM DNA MICRO-ARRAYS TO DISEASE CLASSIFICATION: AN UNSUPERVISED CLUSTERING APPROACH

Author: Boley
De Moor
Garatti
Golub
Golub
Hand
Jain
O'Connel
Savaresi
Selim
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

Crossref

Deterministic Modularity Optimization

Author: Albert
Barabási
Dorogovtsev
Fiedler
Fortunato
Guimerá
Guimerá
Guimerá
Kirkpatrick
L. K. Hansen
Newman
Newman
Newman
Newman
Newman
Newman
Palla
Peterson
Peterson
Pothen
Reichardt
S. Lehmann
Warner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

We study community structure of networks. We have developed a scheme for maximizing the modularity Q based on mean field methods. Further, we have defined a simple family of random networks with community structure; we understand the behavior of these networks analytically. Using these networks, we show how the mean field methods display better performance than previously known deterministic methods for optimization of Q.Comment: 7 pages, 4 figures, minor change

arXiv.org e-Print Archive

CiteSeerX

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Research Papers in Economics

CERN Document Server

Online Research Database In Technology

ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use

Author: Garge Nikhil
Kraj Piotr
McIndoe Richard A
Podolsky Robert
Sharma Ashok
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers. Results The software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Conclusion ParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A novel clustering based method for characterizing household electricity consumption profiles

Author: Del-Campo-Ávila José
Mora-López Llanos
Rodríguez-Gómez Francisco
Publication venue: Elsevier
Publication date: 12/12/2023
Field of study

A new methodology based on expert knowledge and data mining is proposed to obtain data-driven models that characterize household consumption profiles. These profiles are useful for electricity marketers to understand their customers’ consumption. They could then adjust their electricity purchases in the market and provide recommendations to their customers to manage their consumption. The novelty of this research work is proposing a new procedure to determine an adequate number of clusters for a clustering task. Therefore, the proposed new methodology includes this novel procedure to build the models in two phases. In the first phase, clustering algorithms are used to group the data using different numbers of clusters. For the second phase, a new procedure (k-ISAC_TLP) is proposed and used to select the most appropriate number of clusters. This methodology allows the inclusion of domain information. In the case of household electricity consumption, where only groups with a significant number are relevant as long as the error is small, it allows the use of metrics like the mean absolute error and the number of observations (daily electricity consumption profiles). According to experts, the results achieved in two real datasets (from Spain and Ireland), with millions of observations support the methodology and reveal novel knowledge. In both cases, two and a half million observations have been analyzed and around twenty electricity consumption profiles have been detected. The methodology is easily extensible to problems of any domain where clustering algorithms are applicable. A software solution has been implemented and made freely available.Funding for open access charge: Universidad de Málaga/CBUA . The authors would like to thank the University College Dublin Library the access to the Irish Social Science Data Archive (ISSDA). This work was supported by Grant RTI2018-095097-B-I00 funded by MCIN (Spain), Grant CPP2021-008403 funded by MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”

Repositorio Institucional Universidad de Málaga

Data Mining Techniques for Complex User-Generated Data

Author: XIAO XIN
Publication venue: country:Italy
Publication date: 01/01/2016
Field of study

Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario

Author: CERQUITELLI TANIA
CHIUSANO SILVIA ANNA
XIAO XIN
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology. This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis as well as categorizing new unlabeled records. A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also discussed from a medical perspective

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)