Search CORE

40 research outputs found

The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning

Author: Andrei Shestakov
Arbelaitz
Bertoni
Boris Mirkin
Caliński
Chan
de Amorim
de Amorim
de Amorim
de Amorim
Field
Hadjitodorov
Hartigan
Huang
Huang
Huang
Hubert
Jain
Ji
Kuncheva
Legendre
MacQueen
Makarenkov
MATLAB
Milligan
Mirkin
Mirkin
Ng
Pividori
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Saitou
Steinley
Steinley
Topchy
Vladimir Makarenkov
Von Luxburg
Yang
Publication venue: 'Elsevier BV'
Publication date: 02/02/2017
Field of study

The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data

University of Essex Research Repository

Crossref

Birkbeck Institutional Research Online

University of Hertfordshire Research Archive

Identifying meaningful clusters in malware data

Author: Amorim Renato
Lopez Ruiz Carlos D
Publication venue: 'Elsevier BV'
Publication date: 01/09/2021
Field of study

Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to normalised data sets. However, the process of normalisation aims at setting features with different range values to have a similar contribution to the clustering. It does not favour more meaningful features over those that are less meaningful, an effect one should perhaps expect of the data pre-processing stage. In this paper we introduce a method to deal precisely with the problem above. This is an iterative data pre-processing method capable of aiding to increase the separation between clusters. It does so by calculating the within-cluster degree of relevance of each feature, and then it uses these as a data rescaling factor. By repeating this until convergence our malware data was separated in clear clusters, leading to a higher average silhouette width

University of Essex Research Repository

Hasil similarity check - Artikel : Weighted Minkowski Similarity Method with CBR for Diagnosing Cardiovascular Disease

Author: Faizal Edi
Hamdani Hamdani
Publication venue: 'STMIK AKAKOM Yogyakarta'
Publication date: 01/01/2019
Field of study

Akakom Repository

On k-means iterations and Gaussian clusters

Author: Amorim Renato
Makarenkov Vladimir
Publication venue: Elsevier
Publication date: 07/10/2023
Field of study

Nowadays, k-means remains arguably the most popular clustering algorithm (Jain, 2010; Vouros et al., 2021). Two of its main properties are simplicity and speed in practice. Here, our main claim is that the average number of iterations k-means takes to converge (τ¯) is in fact very informative. We find this to be particularly interesting because τ¯ is always known when applying k-means but has never been, to our knowledge, used in the data analysis process. By experimenting with Gaussian clusters, we show that τ¯ is related to the structure of a data set under study. Data sets containing Gaussian clusters have a much lower τ¯ than those containing uniformly random data. In fact, we go considerably further and demonstrate a pattern of inverse correlation between τ¯ and the clustering quality. We illustrate the importance of our findings through two practical applications. First, we describe the cases in which τ¯ can be effectively used to identify irrelevant features present in a given data set or be used to improve the results of existing feature selection algorithms. Second, we show that there is a strong relationship between τ¯ and the number of clusters in a data set, and that this relationship can be used to find the true number of clusters it contains

University of Essex Research Repository

Learning Feature Weights for Density-Based Clustering

Author: Chowdhury Stiphen
Publication venue
Publication date: 01/09/2021
Field of study

K-Means is the most popular and widely used clustering algorithm. This algorithm cannot recover non-spherical shape clusters in data sets. DBSCAN is arguably the most popular algorithm to recover arbitrary shape clusters; this is why this density-based clustering algorithm is of great interest to tackle its weaknesses. One issue of concern is that DBSCAN requires two parameters, and it cannot recover widely variable density clusters. The problem lies at the heart of this thesis is that during the clustering process DBSCAN takes all the available features and treats all the features equally regardless of their degree of relevance in the data set, which can have negative impacts. This thesis addresses the above problems by laying the foundation of the feature weighted density-based clustering. Specifically, the thesis introduces a densitybased clustering algorithm using reverse nearest neighbour, DBSCANR that require less parameter than DBSCAN for recovering clusters. DBSCANR is based on the insight that in real-world data sets the densities of arbitrary shape clusters to be recovered within a data set are very different from each other. The thesis extends DBSCANR to what is referred to as weighted DBSCANR, WDBSCANR by exploiting feature weighting technique to give the different level of relevance to the features in a data set. The thesis extends W-DBSCANR further by using the Minkowski metric so that the weight can be interpreted as feature re-scaling factors named MW-DBSCANR. Experiments on both artificial and realworld data sets demonstrate the superiority of our method over DBSCAN type algorithms. These weighted algorithms considerably reduce the impact of irrelevant features while recovering arbitrary shape clusters of different level of densities in a high-dimensional data set. Within this context, this thesis incorporates a popular algorithm, feature selection using feature similarity, FSFS into bothW-DBSCANR andMW-DBSCANR, to address the problem of feature selection. This unsupervised feature selection algorithm makes use of feature clustering and feature similarity to reduce the number of features in a data set. With a similar aim, exploiting the concept of feature similarity, the thesis introduces a method, density-based feature selection using feature similarity, DBFSFS to take density-based cluster structure into consideration for reducing the number of features in a data set. This thesis then applies the developed method to real-world high-dimensional gene expression data sets. DBFSFS improves the clustering recovery by substantially reducing the number of features from high-dimensional low sample size data sets

University of Hertfordshire Research Archive

43rd International Symposium on Mathematical Foundations of Computer Science: MFCS 2018, August 27-31, 2018, Liverpool, United Kingdom

Author: International Symposium on Mathematical Foundations of Computer Science <43. 2018, Liverpool>
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 01/08/2018
Field of study

Digitale Bibliothek Thüringen

An Inferential Framework for Network Hypothesis Tests: With Applications to Biological Networks

Author: Yates Phillip
Publication venue: VCU Scholars Compass
Publication date: 30/06/2010
Field of study

The analysis of weighted co-expression gene sets is gaining momentum in systems biology. In addition to substantial research directed toward inferring co-expression networks on the basis of microarray/high-throughput sequencing data, inferential methods are being developed to compare gene networks across one or more phenotypes. Common gene set hypothesis testing procedures are mostly confined to comparing average gene/node transcription levels between one or more groups and make limited use of additional network features, e.g., edges induced by significant partial correlations. Ignoring the gene set architecture disregards relevant network topological comparisons and can result in familiar

VCU Scholars Compass

LIPIcs, Volume 274, ESA 2023, Complete Volume

Author: Farach-Colton Martin
Herman Grzegorz
Puglisi Simon J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 274, ESA 2023, Complete Volum

Dagstuhl Research Online Publication Server

27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

Author: ESA <27. 2019, München>
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 01/09/2019
Field of study

Digitale Bibliothek Thüringen