332 research outputs found
Incremental Clustering: The Case for Extra Clusters
The explosion in the amount of data available for analysis often necessitates
a transition from batch to incremental clustering methods, which process one
element at a time and typically store only a small subset of the data. In this
paper, we initiate the formal analysis of incremental clustering methods
focusing on the types of cluster structure that they are able to detect. We
find that the incremental setting is strictly weaker than the batch model,
proving that a fundamental class of cluster structures that can readily be
detected in the batch setting is impossible to identify using any incremental
method. Furthermore, we show how the limitations of incremental clustering can
be overcome by allowing additional clusters
Axioms for graph clustering quality functions
We investigate properties that intuitively ought to be satisfied by graph
clustering quality functions, that is, functions that assign a score to a
clustering of a graph. Graph clustering, also known as network community
detection, is often performed by optimizing such a function. Two axioms
tailored for graph clustering quality functions are introduced, and the four
axioms introduced in previous work on distance based clustering are
reformulated and generalized for the graph setting. We show that modularity, a
standard quality function for graph clustering, does not satisfy all of these
six properties. This motivates the derivation of a new family of quality
functions, adaptive scale modularity, which does satisfy the proposed axioms.
Adaptive scale modularity has two parameters, which give greater flexibility in
the kinds of clusterings that can be found. Standard graph clustering quality
functions, such as normalized cut and unnormalized cut, are obtained as special
cases of adaptive scale modularity.
In general, the results of our investigation indicate that the considered
axiomatic framework covers existing `good' quality functions for graph
clustering, and can be used to derive an interesting new family of quality
functions.Comment: 23 pages. Full text and sources available on:
http://www.cs.ru.nl/~T.vanLaarhoven/graph-clustering-axioms-2014
What are the true clusters?
Constructivist philosophy and Hasok Chang's active scientific realism are
used to argue that the idea of "truth" in cluster analysis depends on the
context and the clustering aims. Different characteristics of clusterings are
required in different situations. Researchers should be explicit about on what
requirements and what idea of "true clusters" their research is based, because
clustering becomes scientific not through uniqueness but through transparent
and open communication. The idea of "natural kinds" is a human construct, but
it highlights the human experience that the reality outside the observer's
control seems to make certain distinctions between categories inevitable.
Various desirable characteristics of clusterings and various approaches to
define a context-dependent truth are listed, and I discuss what impact these
ideas can have on the comparison of clustering methods, and the choice of a
clustering methods and related decisions in practice
Towards Theoretical Foundations of Clustering
Clustering is a central unsupervised learning task with a wide variety of applications. Unlike in supervised learning, different clustering algorithms may yield dramatically different outputs for the same input sets. As such, the choice of algorithm is crucial. When selecting a clustering algorithm, users tend to focus on cost-related considerations, such as running times, software purchasing costs, etc. Yet differences concerning the output of the algorithms are a more primal consideration. We propose an approach for selecting clustering algorithms based on differences in their input-output behaviour. This approach relies on identifying significant properties of clustering algorithms and classifying algorithms based on the properties that they satisfy.
We begin with Kleinberg's impossibility result, which relies on concise abstract properties that are well-suited for our approach. Kleinberg showed that three specific properties cannot be satisfied by the same algorithm. We illustrate that the impossibility result is a consequence of the formalism used, proving that these properties can be formulated without leading to inconsistency in the context of clustering quality measures or algorithms whose input requires the number of clusters.
Combining Kleinberg's properties with newly proposed ones, we provide an extensive property-base classification of common clustering paradigms. We use some of these properties to provide a novel characterization of the class of linkage-based algorithms. That is, we distil a small set of properties that uniquely identify this family of algorithms.
Lastly, we investigate how the output of algorithms is affected by the addition of small, potentially adversarial, sets of points. We prove that given clusterable input, the output of -means is robust to the addition of a small number of data points. On the other hand, clusterings produced by many well-known methods, including linkage-based techniques, can be changed radically by adding a small number of elements
- ā¦