Search CORE

455 research outputs found

Relational visual cluster validity

Author: Ding Y.
Harrison R.F.
Publication venue: 'Elsevier BV'
Publication date: 01/11/2007
Field of study

The assessment of cluster validity plays a very important role in cluster analysis. Most commonly used cluster validity methods are based on statistical hypothesis testing or finding the best clustering scheme by computing a number of different cluster validity indices. A number of visual methods of cluster validity have been produced to display directly the validity of clusters by mapping data into two- or three-dimensional space. However, these methods may lose too much information to correctly estimate the results of clustering algorithms. Although the visual cluster validity (VCV) method of Hathaway and Bezdek can successfully solve this problem, it can only be applied for object data, i.e. feature measurements. There are very few validity methods that can be used to analyze the validity of data where only a similarity or dissimilarity relation exists – relational data. To tackle this problem, this paper presents a relational visual cluster validity (RVCV) method to assess the validity of clustering relational data. This is done by combining the results of the non-Euclidean relational fuzzy c-means (NERFCM) algorithm with a modification of the VCV method to produce a visual representation of cluster validity. RVCV can cluster complete and incomplete relational data and adds to the visual cluster validity theory. Numeric examples using synthetic and real data are presente

White Rose Research Online

A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

Author: Al Hasan
Al-Daoud
Aloise
Aloise
Anderberg
Babu
Babu
Ball
Bei
Bergmann
Bottou
Breunig
Cao
Celebi
Chen
Chen
Daniel
Forgy
Friedman
Garcia
Garcia
Gonzalez
Hartigan
Hassan A. Kingravi
Hotelling
Huang
Huang
Hubert
Hyvärinen
Iman
Jain
Jain
Jancey
Kanungo
Katsavounidis
Kaufman
Lance
Likas
Linde
Lloyd
Lu
Luengo
M. Emre Celebi
Maitra
Mao
Matsumoto
Meilă
Milligan
Milligan
Norušis
Onoda
Ordonez
Pal
Patricio A. Vela
Pena
Redmond
Selim
Späth
Su
Tarsitano
Tou
Wu
Zhang
Publication venue: 'Elsevier BV'
Publication date: 10/09/2012
Field of study

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table

arXiv.org e-Print Archive

Crossref

Minimal Learning Machine: Theoretical Results and Clustering-Based Reference Point Selection

Author: Alencar Alisson S. C.
Gomes João P. P.
Hämäläinen Joonas
Júnior Amauri H. Souza
Kärkkäinen Tommi
Mattos César L. C.
Publication venue
Publication date: 01/01/2020
Field of study

The Minimal Learning Machine (MLM) is a nonlinear supervised approach based on learning a linear mapping between distance matrices computed in the input and output data spaces, where distances are calculated using a subset of points called reference points. Its simple formulation has attracted several recent works on extensions and applications. In this paper, we aim to address some open questions related to the MLM. First, we detail theoretical aspects that assure the interpolation and universal approximation capabilities of the MLM, which were previously only empirically verified. Second, we identify the task of selecting reference points as having major importance for the MLM's generalization capability. Several clustering-based methods for reference point selection in regression scenarios are then proposed and analyzed. Based on an extensive empirical evaluation, we conclude that the evaluated methods are both scalable and useful. Specifically, for a small number of reference points, the clustering-based methods outperformed the standard random selection of the original MLM formulation.Comment: 29 pages, Accepted to JML

arXiv.org e-Print Archive

Jyväskylä University Digital Archive

Generalized Markov Chain Monte Carlo Initialization for Clustering Gaussian Mixtures Using K-means

Author: Ritu Rajawat, Iti Sharma
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2018
Field of study

Gaussian mixtures are considered to be a good estimate of real life data. Any clustering algorithm that can efficiently cluster such mixtures is expected to work well in practical applications dealing with real life data. K-means is popular for such applications given its ease of implementation and scalability; yet it suffers from the plague of poor seeding. Moreover, if the Gaussian mixture has overlapping clusters, k-means is not able to separate them if initial conditions are not good. Kmeans++ is a good seeding method with high time complexity. It can be made fast by using Markov chain Monte Carlo sampling. This paper proposes a method that improves seed quality and retains speed of sampling technique. The desired effects are demonstrated on several Gaussian mixtures

International Journal on Recent and Innovation Trends in Computing and Communication

An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations

Author: Croucher M.
Langdell S.
Vasilaki E.
Vouros A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/02/2021
Field of study

K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages, such as the positions of the initial clustering centres (centroids), which can greatly affect the clustering solution. Over the years many K-Means variations and initialisations techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations and deterministic initialisation techniques and we first show that more sophisticated initialisation methods reduce or alleviates the need of complex K-Means clustering, and secondly, that deterministic methods can achieve equivalent or better performance than stochastic methods. These conclusions are obtained through extensive benchmarking using different model data sets from various studies as well as clustering data sets

arXiv.org e-Print Archive

White Rose Research Online

Fairness-aware Influence Blocking Maximization for combating Fake News

Author: Gutiérrez Bierbooms Cristina
Publication venue
Publication date: 13/09/2022
Field of study

Pure OAI Repository

Dark Quest. I. Fast and Accurate Emulation of Halo Clustering Statistics and Its Application to Galaxy Clustering

Author: Kobayashi Yosuke
Miyatake Hironao
Murata Ryoma
Nishimichi Takahiro
Oguri Masamune
Oogi Taira
Osato Ken
Shirasaki Masato
Takada Masahiro
Takahashi Ryuichi
Yoshida Naoki
Publication venue: 'American Astronomical Society'
Publication date: 26/07/2019
Field of study

We perform an ensemble of

N

-body simulations with

2048^3

particles for 101 flat

w

CDM cosmological models sampled based on a maximin-distance Sliced Latin Hypercube Design. By using the halo catalogs extracted at multiple redshifts in the range of

z=[0,1.48]

, we develop Dark Emulator, which enables fast and accurate computations of the halo mass function, halo-matter cross-correlation, and halo auto-correlation as a function of halo masses, redshift, separations and cosmological models, based on the Principal Component Analysis and the Gaussian Process Regression for the large-dimensional input and output data vector. We assess the performance of the emulator using a validation set of

N

-body simulations that are not used in training the emulator. We show that, for typical halos hosting CMASS galaxies in the Sloan Digital Sky Survey, the emulator predicts the halo-matter cross correlation, relevant for galaxy-galaxy weak lensing, with an accuracy better than

2\%

and the halo auto-correlation, relevant for galaxy clustering correlation, with an accuracy better than

4\%

. We give several demonstrations of the emulator. It can be used to study properties of halo mass density profiles such as the mass-concentration relation and splashback radius for different cosmologies. The emulator outputs can be combined with an analytical prescription of halo-galaxy connection such as the halo occupation distribution at the equation level, instead of using the mock catalogs, to make accurate predictions of galaxy clustering statistics such as the galaxy-galaxy weak lensing and the projected correlation function for any model within the

w

CDM cosmologies, in a few CPU seconds.Comment: 46 pages, 47 figures; version accepted for publication in Ap

arXiv.org e-Print Archive

Kyoto University Research Information Repository

Maximin Designs for Computer Experiments.

Author: Husslage B.G.M.
Publication venue
Publication date
Field of study

Decision processes are nowadays often facilitated by simulation tools. In the field of engineering, for example, such tools are used to simulate the behavior of products and processes. Simulation runs, however, are often very time-consuming, and, hence, the number of simulation runs allowed is limited in practice. The problem then is to determine which simulation runs to perform such that the maximal amount of information about the product or process is obtained. This problem is addressed in the first part of the thesis. It is proposed to use so-called maximin Latin hypercube designs and many new results for this class of designs are obtained. In the second part, the case of multiple interrelated simulation tools is considered and a framework to deal with such tools is introduced. Important steps in this framework are the construction and the use of coordination methods and of nested designs in order to control the dependencies present between the various simulation tools

Research Papers in Economics