106,869 research outputs found
Selecting the Number of Clusters with a Stability Trade-off: an Internal Validation Criterion
Model selection is a major challenge in non-parametric clustering. There is
no universally admitted way to evaluate clustering results for the obvious
reason that there is no ground truth against which results could be tested, as
in supervised learning. The difficulty to find a universal evaluation criterion
is a direct consequence of the fundamentally ill-defined objective of
clustering. In this perspective, clustering stability has emerged as a natural
and model-agnostic principle: an algorithm should find stable structures in the
data. If data sets are repeatedly sampled from the same underlying
distribution, an algorithm should find similar partitions. However, it turns
out that stability alone is not a well-suited tool to determine the number of
clusters. For instance, it is unable to detect if the number of clusters is too
small. We propose a new principle for clustering validation: a good clustering
should be stable, and within each cluster, there should exist no stable
partition. This principle leads to a novel internal clustering validity
criterion based on between-cluster and within-cluster stability, overcoming
limitations of previous stability-based methods. We empirically show the
superior ability of additive noise to discover structures, compared with
sampling-based perturbation. We demonstrate the effectiveness of our method for
selecting the number of clusters through a large number of experiments and
compare it with existing evaluation methods.Comment: 43 page
Benchmarking in cluster analysis: A white paper
To achieve scientific progress in terms of building a cumulative body of
knowledge, careful attention to benchmarking is of the utmost importance. This
means that proposals of new methods of data pre-processing, new data-analytic
techniques, and new methods of output post-processing, should be extensively
and carefully compared with existing alternatives, and that existing methods
should be subjected to neutral comparison studies. To date, benchmarking and
recommendations for benchmarking have been frequently seen in the context of
supervised learning. Unfortunately, there has been a dearth of guidelines for
benchmarking in an unsupervised setting, with the area of clustering as an
important subdomain. To address this problem, discussion is given to the
theoretical conceptual underpinnings of benchmarking in the field of cluster
analysis by means of simulated as well as empirical data. Subsequently, the
practicalities of how to address benchmarking questions in clustering are dealt
with, and foundational recommendations are made
Integrating Articulatory Features into HMM-based Parametric Speech Synthesis
This paper presents an investigation of ways to integrate articulatory features into Hidden Markov Model (HMM)-based parametric speech synthesis, primarily with the aim of improving the performance of acoustic parameter generation. The joint distribution of acoustic and articulatory features is estimated during training and is then used for parameter generation at synthesis time in conjunction with a maximum-likelihood criterion. Different model structures are explored to allow the articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. The results of objective evaluation show that the accuracy of acoustic parameter prediction can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. More significantly, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of acoustic parameter generation
Clustering as an example of optimizing arbitrarily chosen objective functions
This paper is a reflection upon a common practice of solving various types of learning problems by optimizing arbitrarily chosen criteria in the hope that they are well correlated with the criterion actually used for assessment of the results. This issue has been investigated using clustering as an example, hence a unified view of clustering as an optimization problem is first proposed, stemming from the belief that typical design choices in clustering, like the number of clusters or similarity measure can be, and often are suboptimal, also from the point of view of clustering quality measures later used for algorithm comparison and ranking. In order to illustrate our point we propose a generalized clustering framework and provide a proof-of-concept using standard benchmark datasets and two popular clustering methods for comparison
Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence
In this paper we introduce evidence transfer for clustering, a deep learning
method that can incrementally manipulate the latent representations of an
autoencoder, according to external categorical evidence, in order to improve a
clustering outcome. By evidence transfer we define the process by which the
categorical outcome of an external, auxiliary task is exploited to improve a
primary task, in this case representation learning for clustering. Our proposed
method makes no assumptions regarding the categorical evidence presented, nor
the structure of the latent space. We compare our method, against the baseline
solution by performing k-means clustering before and after its deployment.
Experiments with three different kinds of evidence show that our method
effectively manipulates the latent representations when introduced with real
corresponding evidence, while remaining robust when presented with low quality
evidence
- …