455 research outputs found

    Relational visual cluster validity

    Get PDF
    The assessment of cluster validity plays a very important role in cluster analysis. Most commonly used cluster validity methods are based on statistical hypothesis testing or finding the best clustering scheme by computing a number of different cluster validity indices. A number of visual methods of cluster validity have been produced to display directly the validity of clusters by mapping data into two- or three-dimensional space. However, these methods may lose too much information to correctly estimate the results of clustering algorithms. Although the visual cluster validity (VCV) method of Hathaway and Bezdek can successfully solve this problem, it can only be applied for object data, i.e. feature measurements. There are very few validity methods that can be used to analyze the validity of data where only a similarity or dissimilarity relation exists – relational data. To tackle this problem, this paper presents a relational visual cluster validity (RVCV) method to assess the validity of clustering relational data. This is done by combining the results of the non-Euclidean relational fuzzy c-means (NERFCM) algorithm with a modification of the VCV method to produce a visual representation of cluster validity. RVCV can cluster complete and incomplete relational data and adds to the visual cluster validity theory. Numeric examples using synthetic and real data are presente

    A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table

    Minimal Learning Machine: Theoretical Results and Clustering-Based Reference Point Selection

    Full text link
    The Minimal Learning Machine (MLM) is a nonlinear supervised approach based on learning a linear mapping between distance matrices computed in the input and output data spaces, where distances are calculated using a subset of points called reference points. Its simple formulation has attracted several recent works on extensions and applications. In this paper, we aim to address some open questions related to the MLM. First, we detail theoretical aspects that assure the interpolation and universal approximation capabilities of the MLM, which were previously only empirically verified. Second, we identify the task of selecting reference points as having major importance for the MLM's generalization capability. Several clustering-based methods for reference point selection in regression scenarios are then proposed and analyzed. Based on an extensive empirical evaluation, we conclude that the evaluated methods are both scalable and useful. Specifically, for a small number of reference points, the clustering-based methods outperformed the standard random selection of the original MLM formulation.Comment: 29 pages, Accepted to JML

    Generalized Markov Chain Monte Carlo Initialization for Clustering Gaussian Mixtures Using K-means

    Get PDF
    Gaussian mixtures are considered to be a good estimate of real life data. Any clustering algorithm that can efficiently cluster such mixtures is expected to work well in practical applications dealing with real life data. K-means is popular for such applications given its ease of implementation and scalability; yet it suffers from the plague of poor seeding. Moreover, if the Gaussian mixture has overlapping clusters, k-means is not able to separate them if initial conditions are not good. Kmeans++ is a good seeding method with high time complexity. It can be made fast by using Markov chain Monte Carlo sampling. This paper proposes a method that improves seed quality and retains speed of sampling technique. The desired effects are demonstrated on several Gaussian mixtures

    An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations

    Get PDF
    K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages, such as the positions of the initial clustering centres (centroids), which can greatly affect the clustering solution. Over the years many K-Means variations and initialisations techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations and deterministic initialisation techniques and we first show that more sophisticated initialisation methods reduce or alleviates the need of complex K-Means clustering, and secondly, that deterministic methods can achieve equivalent or better performance than stochastic methods. These conclusions are obtained through extensive benchmarking using different model data sets from various studies as well as clustering data sets

    Dark Quest. I. Fast and Accurate Emulation of Halo Clustering Statistics and Its Application to Galaxy Clustering

    Get PDF
    We perform an ensemble of NN-body simulations with 204832048^3 particles for 101 flat wwCDM cosmological models sampled based on a maximin-distance Sliced Latin Hypercube Design. By using the halo catalogs extracted at multiple redshifts in the range of z=[0,1.48]z=[0,1.48], we develop Dark Emulator, which enables fast and accurate computations of the halo mass function, halo-matter cross-correlation, and halo auto-correlation as a function of halo masses, redshift, separations and cosmological models, based on the Principal Component Analysis and the Gaussian Process Regression for the large-dimensional input and output data vector. We assess the performance of the emulator using a validation set of NN-body simulations that are not used in training the emulator. We show that, for typical halos hosting CMASS galaxies in the Sloan Digital Sky Survey, the emulator predicts the halo-matter cross correlation, relevant for galaxy-galaxy weak lensing, with an accuracy better than 2%2\% and the halo auto-correlation, relevant for galaxy clustering correlation, with an accuracy better than 4%4\%. We give several demonstrations of the emulator. It can be used to study properties of halo mass density profiles such as the mass-concentration relation and splashback radius for different cosmologies. The emulator outputs can be combined with an analytical prescription of halo-galaxy connection such as the halo occupation distribution at the equation level, instead of using the mock catalogs, to make accurate predictions of galaxy clustering statistics such as the galaxy-galaxy weak lensing and the projected correlation function for any model within the wwCDM cosmologies, in a few CPU seconds.Comment: 46 pages, 47 figures; version accepted for publication in Ap

    Maximin Designs for Computer Experiments.

    Get PDF
    Decision processes are nowadays often facilitated by simulation tools. In the field of engineering, for example, such tools are used to simulate the behavior of products and processes. Simulation runs, however, are often very time-consuming, and, hence, the number of simulation runs allowed is limited in practice. The problem then is to determine which simulation runs to perform such that the maximal amount of information about the product or process is obtained. This problem is addressed in the first part of the thesis. It is proposed to use so-called maximin Latin hypercube designs and many new results for this class of designs are obtained. In the second part, the case of multiple interrelated simulation tools is considered and a framework to deal with such tools is introduced. Important steps in this framework are the construction and the use of coordination methods and of nested designs in order to control the dependencies present between the various simulation tools
    corecore