Search CORE

229,895 research outputs found

Automatic Clustering with Single Optimal Solution

Author: Pavan K. Karteeka
Rao A. V. Dattatreya
Rao Allam Appa
Publication venue
Publication date: 01/10/2011
Field of study

Determining optimal number of clusters in a dataset is a challenging task. Though some methods are available, there is no algorithm that produces unique clustering solution. The paper proposes an Automatic Merging for Single Optimal Solution (AMSOS) which aims to generate unique and nearly optimal clusters for the given datasets automatically. The AMSOS is iteratively merges the closest clusters automatically by validating with cluster validity measure to find single and nearly optimal clusters for the given data set. Experiments on both synthetic and real data have proved that the proposed algorithm finds single and nearly optimal clustering structure in terms of number of clusters, compactness and separation.Comment: 13 pages,4 Tables, 3 figure

arXiv.org e-Print Archive

International Institute for Science, Technology and Education (IISTE): E-Journals

Determining the Optimal Number of Clusters with the Clustergram

Author: Aguirre Nathan D.
Davies Misty D.
Fluegemann Joseph K.
Publication venue
Publication date: 30/09/2011
Field of study

Cluster analysis aids research in many different fields, from business to biology to aerospace. It consists of using statistical techniques to group objects in large sets of data into meaningful classes. However, this process of ordering data points presents much uncertainty because it involves several steps, many of which are subject to researcher judgment as well as inconsistencies depending on the specific data type and research goals. These steps include the method used to cluster the data, the variables on which the cluster analysis will be operating, the number of resulting clusters, and parts of the interpretation process. In most cases, the number of clusters must be guessed or estimated before employing the clustering method. Many remedies have been proposed, but none is unassailable and certainly not for all data types. Thus, the aim of current research for better techniques of determining the number of clusters is generally confined to demonstrating that the new technique excels other methods in performance for several disparate data types. Our research makes use of a new cluster-number-determination technique based on the clustergram: a graph that shows how the number of objects in the cluster and the cluster mean (the ordinate) change with the number of clusters (the abscissa). We use the features of the clustergram to make the best determination of the cluster-number

NASA Technical Reports Server

clues: An R Package for Nonparametric Clustering Based on Local Shrinking

Author: Fang Chang
Ross Lazarus
Ruben H. Zamar
Weiliang Qiu
Xiaogang Wang
Publication venue
Publication date
Field of study

Determining the optimal number of clusters appears to be a persistent and controversial issue in cluster analysis. Most existing R packages targeting clustering require the user to specify the number of clusters in advance. However, if this subjectively chosen number is far from optimal, clustering may produce seriously misleading results. In order to address this vexing problem, we develop the R package clues to automate and evaluate the selection of an optimal number of clusters, which is widely applicable in the field of clustering analysis. Package clues uses two main procedures, shrinking and partitioning, to estimate an optimal number of clusters by maximizing an index function, either the CH index or the Silhouette index, rather than relying on guessing a pre-specified number. Five agreement indices (Rand index, Hubert and ArabieÃ¢ÂÂs adjusted Rand index, Morey and AgrestiÃ¢ÂÂs adjusted Rand index, Fowlkes and Mallows index and Jaccard index), which measure the degree of agreement between any two partitions, are also provided in clues. In addition to numerical evidence, clues also supplies a deeper insight into the partitioning process with trajectory plots.

Research Papers in Economics

OptCluster : an R package for determining the optimal clustering algorithm and optimal number of clusters.

Author: Sekula Michael N.
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/05/2015
Field of study

Determining the best clustering algorithm and ideal number of clusters for a particular dataset is a fundamental difficulty in unsupervised clustering analysis. In biological research, data generated from Next Generation Sequencing technology and microarray gene expression data are becoming more and more common, so new tools and resources are needed to group such high dimensional data using clustering analysis. Different clustering algorithms can group data very differently. Therefore, there is a need to determine the best groupings in a given dataset using the most suitable clustering algorithm for that data. This paper presents the R package optCluster as an efficient way for users to evaluate up to ten clustering algorithms, ultimately determining the optimal algorithm and optimal number of clusters for a given set of data. The selected clustering algorithms are evaluated by as many as nine validation measures classified as “biological”, “internal”, or “stability”, and the final result is obtained through a weighted rank aggregation algorithm based on the calculated validation scores. Two examples using this package are presented, one with a microarray dataset and the other with an RNA-Seq dataset. These two examples highlight the capabilities the optCluster package and demonstrate its usefulness as a tool in cluster analysis

University of Louisville

K-MEANS CLUSTER COUNT OPTIMIZATION WITH SILHOUETTE INDEX VALIDATION AND DAVIES BOULDIN INDEX (CASE STUDY: COVERAGE OF PREGNANT WOMEN, CHILDBIRTH, AND POSTPARTUM HEALTH SERVICES IN INDONESIA IN 2020)

Author: Ispriyanti Dwi
Suryaningrum Fahlevi
Utami Iut Tri
Publication venue: 'Universitas Pattimura'
Publication date: 11/06/2023
Field of study

One of the causes of the increasing maternal mortality rate in Indonesia is the declining performance of maternal health services in each Indonesian province. To overcome the decline in performance, namely by determining in advance the provinces that need to be prioritized for services by grouping 34 provinces in Indonesia. This study aims to obtain the best provincial grouping results so that it can prioritize the right provinces. One of the methods that are suitable for grouping provinces is K-Means because it is simple and easy to implement. The disadvantage of K-Means is that it is sensitive to determining the right number of initial clusters, so Silhouette Index and Davies Bouldin Index validation is used to obtain the optimal number of clusters with stable and consistent results. This study used healthcare data for pregnant women, childbirth, and postpartum with K=2, 3, and 4 as the initial cluster number. K-Means objects are grouped in similarities using Euclidean and Manhattan distances. The result obtained was the optimal number of clusters with K=2 using Manhattan, where the highest Silhouette Index value was 0,658685 and the lowest Davies Bouldin Index was 0,3561214 which met the criteria for determining the optimal cluster

OJS UNPATTI Publication Center (Universitas Pattimura)

clues: An R Package for Nonparametric Clustering Based on Local Shrinking

Author: Chang Fang
Lazarus Ross
Qiu Weiliang
Wang Xiaogang
Zamar Ruben H.
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/02/2010
Field of study

Determining the optimal number of clusters appears to be a persistent and controversial issue in cluster analysis. Most existing R packages targeting clustering require the user to specify the number of clusters in advance. However, if this subjectively chosen number is far from optimal, clustering may produce seriously misleading results. In order to address this vexing problem, we develop the R package clues to automate and evaluate the selection of an optimal number of clusters, which is widely applicable in the field of clustering analysis. Package clues uses two main procedures, shrinking and partitioning, to estimate an optimal number of clusters by maximizing an index function, either the CH index or the Silhouette index, rather than relying on guessing a pre-specified number. Five agreement indices (Rand index, Hubert and Arabie's adjusted Rand index, Morey and Agresti's adjusted Rand index, Fowlkes and Mallows index and Jaccard index), which measure the degree of agreement between any two partitions, are also provided in clues. In addition to numerical evidence, clues also supplies a deeper insight into the partitioning process with trajectory plots

Directory of Open Access Journals

Journal of Statistical Software

CLUSTERIZATION OF REGION IN SOUTH SUMATERA BASED ON COVID-19 CASE DATA

Author: Eliyati Ning
Saragih Anita
Sukanda Dian Cahyawati
Publication venue: PATTIMURA UNIVERSITY
Publication date: 30/09/2023
Field of study

Based on Covid-19 case data as of July 2022, South Sumatra Province has the 15th highest rank out of 34 provinces in Indonesia, with confirmed cases totalling 82,407. This showed that the spread of Covid-19 in South Sumatra was still high. This study aimed to determine the cluster of regions in South Sumatra based on Covid-19 case data. Clustering regions used agglomerative hierarchical method. The process began with standardizing the data, calculating the similarity distance between objects, determining the optimal number of clusters using the Silhouette method, and the last was clustering analysis. This study found that the optimal number of clusters consisted of two clusters. The clustering process starts with objects 2 and objects 4 because these two objects have the closest similarity distance. In conclusion, objects with the closest similarity distance (in one cluster) have the same data movement (fluctuation)

OJS UNPATTI Publication Center (Universitas Pattimura)

An Heterogeneous Population-Based Genetic Algorithm for Data Clustering

Author: Bedboudi Amina
Bouras Cherif
Kimour Mohamed Tahar
Publication venue: IAES Indonesia Section
Publication date: 01/09/2017
Field of study

As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)