90,689 research outputs found
Clustering of categorical variables around latent variables
In the framework of clustering, the usual aim is to cluster observations and not variables. However the issue of variable clustering clearly appears for dimension reduction, selection of variables or in some case studies (sensory analysis, biochemistry, marketing, etc.). Clustering of variables is then studied as a way to arrange variables into homogeneous clusters, thereby organizing data into meaningful structures. Once the variables are clustered into groups such that variables are similar to the other variables belonging to their cluster, the selection of a subset of variables is possible. Several specific methods have been developed for the clustering of numerical variables. However concerning categorical variables, much less methods have been proposed. In this paper we extend the criterion used by Vigneau and Qannari (2003) in their Clustering around Latent Variables approach for numerical variables to the case of categorical data. The homogeneity criterion of a cluster of categorical variables is defined as the sum of the correlation ratio between the categorical variables and a latent variable, which is in this case a numerical variable. We show that the latent variable maximizing the homogeneity of a cluster can be obtained with Multiple Correspondence Analysis. Different algorithms for the clustering of categorical variables are proposed: iterative relocation algorithm, ascendant and divisive hierarchical clustering. The proposed methodology is illustrated by a real data application to satisfaction of pleasure craft operators.clustering of categorical variables, correlation ratio, iterative relocation algorithm, hierarchical clustering
Recommended from our members
COMPACT REPRESENTATIONS OF UNCERTAINTY IN CLUSTERING
Flat clustering and hierarchical clustering are two fundamental tasks, often used to discover meaningful structures in data, such as subtypes of cancer, phylogenetic relationships, taxonomies of concepts, and cascades of particle decays in particle physics. When multiple clusterings of the data are possible, it is useful to represent uncertainty in clustering through various probabilistic quantities, such as the distribution over partitions or tree structures, and the marginal probabilities of subpartitions or subtrees.
Many compact representations exist for structured prediction problems, enabling the efficient computation of probability distributions, e.g., a trellis structure and corresponding Forward-Backward algorithm for Markov models that model sequences. However, no such representation has been proposed for either flat or hierarchical clustering models. In this thesis, we present our work developing data structures and algorithms for computing probability distributions over flat and hierarchical clusterings, as well as for finding maximum a posteriori (MAP) flat and hierarchical clusterings, and various marginal probabilities, as given by a wide range of energy-based clustering models.
First, we describe a trellis structure that compactly represents distributions over flat or hierarchical clusterings. We also describe related data structures that represent approximate distributions. We then present algorithms that, using these structures, allow us to compute the partition function, MAP clustering, and the marginal proba- bilities of a cluster (and sub-hierarchy, in the case of hierarchical clustering) exactly. We also show how these and related algorithms can be used to approximate these values, and analyze the time and space complexity of our proposed methods. We demonstrate the utility of our approaches using various synthetic data of interest as well as in two real world applications, namely particle physics at the Large Hadron Collider at CERN and in cancer genomics. We conclude with a brief discussion of future work
A soft hierarchical algorithm for the clustering of multiple bioactive chemical compounds
Most of the clustering methods used in the clustering of chemical structures such as Wards, Group Average, K- means and Jarvis-Patrick, are known as hard or crisp as they partition a dataset into strictly disjoint subsets; and thus are not suitable for the clustering of chemical structures exhibiting more than one activity. Although, fuzzy clustering algorithms such as fuzzy c-means provides an inherent mechanism for the clustering of overlapping structures (objects) but this potential of the fuzzy methods which comes from its fuzzy membership functions have not been utilized effectively. In this work a fuzzy hierarchical algorithm is developed which provides a mechanism not only to benefit from the fuzzy clustering process but also to get advantage of the multiple membership function of the fuzzy clustering. The algorithm divides each and every cluster, if its size is larger than a pre-determined threshold, into two sub clusters based on the membership values of each structure. A structure is assigned to one or both the clusters if its membership value is very high or very similar respectively. The performance of the algorithm is evaluated on two bench mark datasets and a large dataset of compound structures derived from MDL MDDR database. The results of the algorithm show significant improvement in comparison to a similar implementation of the hard c-means algorithm
Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems
To comprehend the hierarchical organization of large integrated systems, we
introduce the hierarchical map equation, which reveals multilevel structures in
networks. In this information-theoretic approach, we exploit the duality
between compression and pattern detection; by compressing a description of a
random walker as a proxy for real flow on a network, we find regularities in
the network that induce this system-wide flow. Finding the shortest multilevel
description of the random walker therefore gives us the best hierarchical
clustering of the network, the optimal number of levels and modular partition
at each level, with respect to the dynamics on the network. With a novel search
algorithm, we extract and illustrate the rich multilevel organization of
several large social and biological networks. For example, from the global air
traffic network we uncover countries and continents, and from the pattern of
scientific communication we reveal more than 100 scientific fields organized in
four major disciplines: life sciences, physical sciences, ecology and earth
sciences, and social sciences. In general, we find shallow hierarchical
structures in globally interconnected systems, such as neural networks, and
rich multilevel organizations in systems with highly separated regions, such as
road networks.Comment: 11 pages, 5 figures. For associated code, see
http://www.tp.umu.se/~rosvall/code.htm
Bayesian cluster detection via adjacency modelling
Disease mapping aims to estimate the spatial pattern in disease risk across an area, identifying units which have elevated disease risk. Existing methods use Bayesian hierarchical models with spatially smooth conditional autoregressive priors to estimate risk, but these methods are unable to identify the geographical extent of spatially contiguous high-risk clusters of areal units. Our proposed solution to this problem is a two-stage approach, which produces a set of potential cluster structures for the data and then chooses the optimal structure via a Bayesian hierarchical model. The first stage uses a spatially adjusted hierarchical agglomerative clustering algorithm. The second stage fits a Poisson log-linear model to the data to estimate the optimal cluster structure and the spatial pattern in disease risk. The methodology was applied to a study of chronic obstructive pulmonary disease (COPD) in local authorities in England, where a number of high risk clusters were identified
Identifying Clusters in Bayesian Disease Mapping
Disease mapping is the field of spatial epidemiology interested in estimating
the spatial pattern in disease risk across areal units. One aim is to
identify units exhibiting elevated disease risks, so that public health
interventions can be made. Bayesian hierarchical models with a spatially smooth
conditional autoregressive prior are used for this purpose, but they cannot
identify the spatial extent of high-risk clusters. Therefore we propose a two
stage solution to this problem, with the first stage being a spatially adjusted
hierarchical agglomerative clustering algorithm. This algorithm is applied to
data prior to the study period, and produces potential cluster structures
for the disease data. The second stage fits a separate Poisson log-linear model
to the study data for each cluster structure, which allows for step-changes in
risk where two clusters meet. The most appropriate cluster structure is chosen
by model comparison techniques, specifically by minimising the Deviance
Information Criterion. The efficacy of the methodology is established by a
simulation study, and is illustrated by a study of respiratory disease risk in
Glasgow, Scotland
Topological Hierarchies and Decomposition: From Clustering to Persistence
Hierarchical clustering is a class of algorithms commonly used in exploratory data analysis (EDA) and supervised learning. However, they suffer from some drawbacks, including the difficulty of interpreting the resulting dendrogram, arbitrariness in the choice of cut to obtain a flat clustering, and the lack of an obvious way of comparing individual clusters. In this dissertation, we develop the notion of a topological hierarchy on recursively-defined subsets of a metric space. We look to the field of topological data analysis (TDA) for the mathematical background to associate topological structures such as simplicial complexes and maps of covers to clusters in a hierarchy. Our main results include the definition of a novel hierarchical algorithm for constructing a topological hierarchy, and an implementation of the MAPPER algorithm and our topological hierarchies in pure Python code as well as a web app dashboard for exploratory data analysis. We show that the algorithm scales well to high-dimensional data due to the use of dimensionality reduction in most TDA methods, and analyze the worst-case time complexity of MAPPER and our hierarchical decomposition algorithm. Finally, we give a use case for exploratory data analysis with our techniques
- …