186 research outputs found
Cluster Analyses of a Target Data Set from the IFCS Cluster Benchmark Data Repository: Introduction to the Special Issue
After a brief introduction to benchmarking in data analysis in general and in cluster analysis in particular, we describe the setup of the IFCS Cluster Benchmark Data Repository along with two challenges connected with it. The first of these challenges called for data sets to be contributed to the repository; the second one pertained to cluster analyses of the winning data set of the first challenge. Subsequently, we introduce the winning data set of the first challenge together with relevant meta-data. We conclude with a brief description of the organization of the present special issue, which comprises reports of analyses that have been submitted as contributions to the second challenge
Onset of an outline map to get a hold on the wildwood of clustering methods
The domain of cluster analysis is a meeting point for a very rich
multidisciplinary encounter, with cluster-analytic methods being studied and
developed in discrete mathematics, numerical analysis, statistics, data
analysis and data science, and computer science (including machine learning,
data mining, and knowledge discovery), to name but a few. The other side of the
coin, however, is that the domain suffers from a major accessibility problem as
well as from the fact that it is rife with division across many pretty isolated
islands. As a way out, the present paper offers an outline map for the
clustering domain as a whole, which takes the form of an overarching conceptual
framework and a common language. With this framework we wish to contribute to
structuring the domain, to characterizing methods that have often been
developed and studied in quite different contexts, to identifying links between
them, and to introducing a frame of reference for optimally setting up cluster
analyses in data-analytic practice.Comment: 33 pages, 4 figure
Recommended from our members
Bayesian hierarchical classes analysis
Hierarchical classes models are models for N-way N-mode data that represent the association among the N modes and simultaneously yield, for each mode, a hierarchical classification of its elements. In this paper we present a stochastic extension of the hierarchical classes model for two-way two-mode binary data. In line with the original model, the new probabilistic extension still represents both the association among the two modes and the hierarchical classifications. A fully Bayesian method for fitting the new model is presented and evaluated in a simulation study. Furthermore, we propose tools for model selection and model checking based on Bayes factors and posterior predictive checks. We illustrate the advantages of the new approach with applications in the domain of the psychology of choice and psychiatric diagnosis
Joint mapping of genes and conditions via multidimensional unfolding analysis
<p>Abstract</p> <p>Background</p> <p>Microarray compendia profile the expression of genes in a number of experimental conditions. Such data compendia are useful not only to group genes and conditions based on their similarity in overall expression over profiles but also to gain information on more subtle relations between genes and conditions. Getting a clear visual overview of all these patterns in a single easy-to-grasp representation is a useful preliminary analysis step: We propose to use for this purpose an advanced exploratory method, called multidimensional unfolding.</p> <p>Results</p> <p>We present a novel algorithm for multidimensional unfolding that overcomes both general problems and problems that are specific for the analysis of gene expression data sets. Applying the algorithm to two publicly available microarray compendia illustrates its power as a tool for exploratory data analysis: The unfolding analysis of a first data set resulted in a two-dimensional representation which clearly reveals temporal regulation patterns for the genes and a meaningful structure for the time points, while the analysis of a second data set showed the algorithm's ability to go beyond a mere identification of those genes that discriminate between different patient or tissue types.</p> <p>Conclusion</p> <p>Multidimensional unfolding offers a useful tool for preliminary explorations of microarray data: By relying on an easy-to-grasp low-dimensional geometric framework, relations among genes, among conditions and between genes and conditions are simultaneously represented in an accessible way which may reveal interesting patterns in the data. An additional advantage of the method is that it can be applied to the raw data without necessitating the choice of suitable genewise transformations of the data.</p
Benchmarking in cluster analysis: A white paper
To achieve scientific progress in terms of building a cumulative body of
knowledge, careful attention to benchmarking is of the utmost importance. This
means that proposals of new methods of data pre-processing, new data-analytic
techniques, and new methods of output post-processing, should be extensively
and carefully compared with existing alternatives, and that existing methods
should be subjected to neutral comparison studies. To date, benchmarking and
recommendations for benchmarking have been frequently seen in the context of
supervised learning. Unfortunately, there has been a dearth of guidelines for
benchmarking in an unsupervised setting, with the area of clustering as an
important subdomain. To address this problem, discussion is given to the
theoretical conceptual underpinnings of benchmarking in the field of cluster
analysis by means of simulated as well as empirical data. Subsequently, the
practicalities of how to address benchmarking questions in clustering are dealt
with, and foundational recommendations are made
- …