114,519 research outputs found
Fuzzy clustering of univariate and multivariate time series by genetic multiobjective optimization
Given a set of time series, it is of interest to discover subsets that share similar properties. For instance, this may be useful for identifying and estimating a single model that may fit conveniently several time series, instead of performing the usual identification and estimation steps for each one. On the other hand time series in the same cluster are related with respect to the measures assumed for cluster analysis and are suitable for building multivariate time series models. Though many approaches to clustering time series exist, in this view the most effective method seems to have to rely on choosing some features relevant for the problem at hand and seeking for clusters according to their measurements, for instance the autoregressive coe±cients, spectral measures or the eigenvectors of the covariance matrix. Some new indexes based on goodnessof-fit criteria will be proposed in this paper for fuzzy clustering of multivariate time series. A general purpose fuzzy clustering algorithm may be used to estimate the proper cluster structure according to some internal criteria of cluster validity. Such indexes are known to measure actually definite often conflicting cluster properties, compactness or connectedness, for instance, or distribution, orientation, size and shape. It is argued that the multiobjective optimization supported by genetic algorithms is a most effective choice in such a di±cult context. In this paper we use the Xie-Beni index and the C-means functional as objective functions to evaluate the cluster validity in a multiobjective optimization framework. The concept of Pareto optimality in multiobjective genetic algorithms is used to evolve a set of potential solutions towards a set of optimal non-dominated solutions. Genetic algorithms are well suited for implementing di±cult optimization problems where objective functions do not usually have good mathematical properties such as continuity, differentiability or convexity. In addition the genetic algorithms, as population based methods, may yield a complete Pareto front at each step of the iterative evolutionary procedure. The method is illustrated by means of a set of real data and an artificial multivariate time series data set.Fuzzy clustering, Internal criteria of cluster validity, Genetic algorithms, Multiobjective optimization, Time series, Pareto optimality
Comparative cluster labelling involving external text sources
Giving clear, straightforward names to individual result groups of clustering data is most important in making research usable. This is especially so when clustering is the real outcome of the analysis and not just a tool for data preparation. In this case, the underlying concept of the cluster itself makes the result meaningful and useful. However, a cluster comes alive only in the investigator’s mind since it can be defined or described in words. Our method introduced in this paper aims to facilitate and partly automate this verbal characterisation process. The external text database is joined to the objects of the clustering that adds new, previously unused features to the data set. Clusters are described by labels produced by text mining analytics. The validity of clustering can be characterised by the shape of the final word cloud
On-line evolving fuzzy clustering
In this paper, a novel on-line evolving fuzzy clustering method that extends the evolving clustering method (ECM) of Kasabov and Song (2002) is presented, called EFCM. Since it is an on-line algorithm, the fuzzy membership matrix of the data is updated whenever the existing cluster expands, or a new cluster is formed. EFCM does not need the numbers of the clusters to be pre-defined. The algorithm is tested on several benchmark data sets, such as Iris, Wine, Glass, E-Coli, Yeast and Italian Olive oils. EFCM results in the least objective function value compared to the ECM and Fuzzy C-Means. It is significantly faster (by several orders of magnitude) than any of the off-line batch-mode clustering algorithms. A methodology is also proposed for using theXie-Beni cluster validity measure to optimize the number of clusters. © 2007 IEEE
Clustering in relational data and ontologies
Title from PDF of title page (University of Missouri--Columbia, viewed on August 20, 2010).The entire thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file; a non-technical public abstract appears in the public.pdf file.Dissertation advisor: Dr. James M. Keller.Vita.Ph. D. University of Missouri--Columbia 2010.This dissertation studies the problem of clustering objects represented by relational data. This is a pertinent problem as many real-world data sets can only be represented by relational data for which object-based clustering algorithms are not designed. Relational data are encountered in many fields including biology, management, industrial engineering, and social sciences. Unlike numerical object data, which are represented by a set of feature values (e.g. height, weight, shoe size) of an object, relational object data are the numerical values of (dis) similarity between objects. For this reason, conventional cluster analysis methods such as k-means and fuzzy c-means cannot be used directly with relational data. I focus on three main problems of cluster analysis of relational data: (i) tendency prior to clustering -- how many clusters are there?; (ii) partitioning of objects -- which objects belong to which cluster?; and (iii) validity of the resultant clusters -- are the partitions \good"?Analyses are included in this dissertation that prove that the Visual Assessment of cluster Tendency (VAT) algorithm has a direct relation to single-linkage hierarchical clustering and Dunn's cluster validity index. These analyses are important to the development of two novel clustering algorithms, CLODD-CLustering in Ordered Dissimilarity Data and ReSL-Rectangular Single-Linkage clustering. Last, this dissertation addresses clustering in ontologies; examples include the Gene Ontology, the MeSH ontology, patient medical records, and web documents. I apply an extension to the Self-Organizing Map (SOM) to produce a new algorithm, the OSOM-Ontological Self-Organizing Map. OSOM provides visualization and linguistic summarization of ontology-based data.Includes bibliographical references
Segmentation of Colour Images by Modified Mountain Clustering
Segmentation of colour images is an important issue in various machine vision and image processing applications. Though clustering techniques have been in vogue for many years, these have not been very effective because of problems like selection of the number of clusters. This problem has been tackled by having a validity measure coupled with the new clustering technique. This method treats each point in the dataset, which is the map of all possible colour combinations in the given image, as a potential cluster centre and estimates its potential wrt other data elements. First, the point with the maximum value of potential is considered to be a cluster centre and then its effect is removed from other points of the dataset. This procedure is repeated to determine different cluster centres. At the same time, the compactness and the minimum separation is computed amongst all the cluster centres, and also the validity function as the ratio of these quantities. The validity function can be used to choose the number of clusters. This technique has been compared to the fuzzy C-means technique and the results have been shown for a sample colour image
Clustering performance analysis using a new correlation-based cluster validity index
There are various cluster validity indices used for evaluating clustering
results. One of the main objectives of using these indices is to seek the
optimal unknown number of clusters. Some indices work well for clusters with
different densities, sizes, and shapes. Yet, one shared weakness of those
validity indices is that they often provide only one optimal number of
clusters. That number is unknown in real-world problems, and there might be
more than one possible option. We develop a new cluster validity index based on
a correlation between an actual distance between a pair of data points and a
centroid distance of clusters that the two points occupy. Our proposed index
constantly yields several local peaks and overcomes the previously stated
weakness. Several experiments in different scenarios, including UCI real-world
data sets, have been conducted to compare the proposed validity index with
several well-known ones. An R package related to this new index called NCvalid
is available at https://github.com/nwiroonsri/NCvalid.Comment: 19 page
Selecting the Number of Clusters with a Stability Trade-off: an Internal Validation Criterion
Model selection is a major challenge in non-parametric clustering. There is
no universally admitted way to evaluate clustering results for the obvious
reason that there is no ground truth against which results could be tested, as
in supervised learning. The difficulty to find a universal evaluation criterion
is a direct consequence of the fundamentally ill-defined objective of
clustering. In this perspective, clustering stability has emerged as a natural
and model-agnostic principle: an algorithm should find stable structures in the
data. If data sets are repeatedly sampled from the same underlying
distribution, an algorithm should find similar partitions. However, it turns
out that stability alone is not a well-suited tool to determine the number of
clusters. For instance, it is unable to detect if the number of clusters is too
small. We propose a new principle for clustering validation: a good clustering
should be stable, and within each cluster, there should exist no stable
partition. This principle leads to a novel internal clustering validity
criterion based on between-cluster and within-cluster stability, overcoming
limitations of previous stability-based methods. We empirically show the
superior ability of additive noise to discover structures, compared with
sampling-based perturbation. We demonstrate the effectiveness of our method for
selecting the number of clusters through a large number of experiments and
compare it with existing evaluation methods.Comment: 43 page
- …