Search CORE

9 research outputs found

An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering

Author: S. Saranya, Dr.P.Jayanthi
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 26/02/2017
Field of study

Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm

International Journal on Recent and Innovation Trends in Computing and Communication

Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

Author: Alguwaizani
Bai
Bai
Bai
Barbara
Bradley
Cao
Cao
Cao
Changhao Huang
Chen
Chen
Franceschi
Frossyniotis
Gan
Ganti
Gilpin
Guha
Gupta
Hansen
Hansen
Hansen
He
Helber
Huang
Ikou Kaku
Jain
Jiang
Jiaoying Huang
Kao
Kaufman
Khan
Khan
Kim
MacQueen
Mladenovic
Mladenović
Mueller
Myhre
Ng
Parmar
Qin
Ralambondrainy
Saha
Sun
Wu
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Yiyong Xiao
Yuchun Xu
Zhao
Publication venue: 'Elsevier BV'
Publication date: 01/06/2019
Field of study

The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results

Crossref

Aston Publications Explorer

An Efficient $k$ -modes Algorithm for Clustering Categorical Datasets

Author: Dorman Karin S.
Maitra Ranjan
Publication venue
Publication date: 01/01/2020
Field of study

Mining clusters from data is an important endeavor in many applications. The

k

-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The

k

-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the

k

-means objective function. We provide a novel, computationally efficient implementation of

k

-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing

k

-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for

k

-modes optimization.Comment: 16 pages, 10 figures, 5 table

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Tutorial: Multivariate Classification for Vibrational Spectroscopy in Biological Samples

Author: A de Juan
A de Juan
A Sakudo
A Savitzky
A Tfayli
AA Bunaciu
AL Pomerantsev
AL Pomerantsev
B Lorenz
BK Alsberg
C Cortes
C Pasquini
C Quintelas
C Scotter
CBY Cordella
CC Chang
CD Brown
CD Brown
CLM Morais
CLM Morais
CLM Morais
CLM Morais
CLM Morais
CLM Morais
D Ballabio
D Cozzolino
D Naumann
DB Hibbert
ECY Li-Chan
EL Callery
F Allegrini
F Jiang
F Marini
F Marini
FA de Lima
FL Martin
FL Martin
G Theophilou
G Weber
H Jin
H Martens
HD Li
HJ Butler
I Pence
J Jacyna
J Jaumot
J Mandel
J McCall
J Schmitt
J Trevisan
J Trevisan
JG Kelly
JH Qu
K De Gussem
K Fawagreh
KH Liland
L Nørgaard
LA Reisner
LE Rodriguez-Saona
LFS Siqueira
LFS Siqueira
LFS Siqueira
LFS Siqueira
LK Bittner
M Baranska
M Ferrés
M Paraskevaidi
M Paraskevaidi
M Paraskevaidi
M Radovic
MB Seasholtz
MCD Santos
MF Buitrago
MJ Baker
MJ Baker
MJ Warrens
MR de Almeida
N Prieto
NF Pérez
O Ibrahim
P Bassan
P Bassan
P Bassan
P Geladi
P Geladi
P Meksiarun
P Zarnowiec
PJ Rousseeuw
Q Hu
Q Yang
R Barnes
R Bro
R Jing
R Karoui
R Weiss
RAV Rossel
RG Brereton
RG Brereton
RG Brereton
RM Jarvis
RM Jarvis
RM Wallace
RW Kennard
S De Bruyne
S de Jong
S Jones
S Stöckel
S Wold
SA Strola
SFC Soares
SJ Dixon
T Cover
W Kiefer
W Wu
WFC Rocha
Y LeCun
YV Zontov
Z Movasaghi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2020
Field of study

Vibrational spectroscopy techniques, such as Fourier-transform infrared (FTIR) and Raman spectroscopy, have been successful methods for studying the interaction of light with biological materials and facilitating novel cell biology analysis. Spectrochemical analysis is very attractive in disease screening and diagnosis, microbiological studies and forensic and environmental investigations because of its low cost, minimal sample preparation, non-destructive nature and substantially accurate results. However, there is now an urgent need for multivariate classification protocols allowing one to analyze biologically derived spectrochemical data to obtain accurate and reliable results. Multivariate classification comprises discriminant analysis and class-modeling techniques where multiple spectral variables are analyzed in conjunction to distinguish and assign unknown samples to pre-defined groups. The requirement for such protocols is demonstrated by the fact that applications of deep-learning algorithms of complex datasets are being increasingly recognized as critical for extracting important information and visualizing it in a readily interpretable form. Hereby, we have provided a tutorial for multivariate classification analysis of vibrational spectroscopy data (FTIR, Raman and near-IR) highlighting a series of critical steps, such as preprocessing, data selection, feature extraction, classification and model validation. This is an essential aspect toward the construction of a practical spectrochemical analysis model for biological analysis in real-world applications, where fast, accurate and reliable classification models are fundamental

CLoK

Crossref

On Parallelization of Categorical Data Clustering

Author: Badiei Bahareh
Publication venue
Publication date: 08/05/2023
Field of study

We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP

Concordia University Research Repository

Novel chemometric approaches towards handling biospectroscopy datasets

Author: Medeiros-De-morais Camilo De lelis
Publication venue
Publication date
Field of study

Chemometrics allows one to identify chemical patterns using spectrochemical information of biological materials, such as tissues and biofluids. This has fundamental importance to overcome limitations in traditional bioanalytical analysis, such as the need for laborious and extreme invasive procedures, high consumption of reagents, and expensive instrumentation. In biospectroscopy, a beam of light, usually in the infrared region, is projected onto the surface of a biological sample and, as a result, a chemical signature is generated containing the vibrational information of most of the molecules in that material. This can be performed in a single-spectra or hyperspectral imaging fashion, where a resultant spectrum is generated for each position (pixel) in the surface of a biological material segment, hence, allowing extraction of both spatial and spectrochemical information simultaneously. As an advantage, these methodologies are non-destructive, have a relatively low-cost, and require minimum sample preparation. However, in biospectroscopy, large datasets containing complex spectrochemical signatures are generated. These datasets are processed by computational tools in order to solve their signal complexity and then provide useful information that can be used for decision taking, such as the identification of clustering patterns distinguishing disease from healthy controls samples; differentiation of tumour grades; prediction of unknown samples categories; or identification of key molecular fragments (biomarkers) associated with the appearance of certain diseases, such as cancer. In this PhD thesis, new computational tools are developed in order to improve the processing of bio-spectrochemical data, providing better clinical outcomes for both spectral and hyperspectral datasets

CLoK