9,491 research outputs found
Voting-Based Consensus of Data Partitions
Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles
Clustering based on weighted ensemble
The clustering is an ill-posed problem and it has been proven that there is no algorithm
that would satisfy all the assumptions about good clustering. This is why numerous
clustering algorithms exist, based on various theories and approaches, one of
them being the well-known Kohonen’s self-organizing map (SOM). Unfortunately,
after training the SOM there is no explicitly obtained information about clusters in
the underlying data, so another technique for grouping SOM units has to be applied
afterwards. In the thesis, a contribution towards a two-level clustering of the SOM
is presented, employing principles of Gravitational Law. The proposed algorithm for
gravitational clustering of the SOM (gSOM) is capable of discovering complex cluster
shapes, not only limited to the spherical ones, and is able to automatically determine
the number of clusters. Experimental comparison with other clustering techniques is
conducted on synthetic and real-world data. We show that gSOM achieves promising
results especially on gene-expression data.
As there is no clustering algorithm that can solve all the problems, it turns out as
very beneficial to analyse the data using multiple partitions of them – an ensemble of
partitions. Cluster-ensemble methods have emerged recently as an effective approach
to stabilize and boost the performance of the single-clustering algorithms. Basically,
data clustering with an ensemble involves two steps: generation of the ensemble with
single-clustering methods and the combination of the obtained solutions to produce a
final consensus partition of the data. To alleviate the consensus step the weighted cluster
ensemble was proposed that tries to assess the relevance of ensemble members. One
way to achieve this is to employ internal cluster validity indices to perform partition
relevance analysis (PRA). Our contribution here is two-fold: first, we propose a novel
cluster validity index DNs that extends the Dunn’s index and is based on the shortest
paths between the data points considering the Gabriel graph on the data; second, we propose an enhancement to the weighted cluster ensemble approach by introducing the
reduction step after the assessment of the ensemble partitions is done. The developed
partition relevance analysis with the reduction step (PRAr) yields promising results
when plugged in the three consensus functions, based on the evidence accumulation
principle.
In the thesis we address all the major stages of data clustering: data generation, data
analysis using single-clustering algorithms, cluster validity using internal end external
indices, and finally the cluster ensemble approach with the focus on the weighted variants.
All the contributions are compared to the state-of-art methods using datasets
from various problem domains. Results are positive and encourage the inclusion of
the proposed algorithms in the machine-learning practitioner’s toolbox
Clustering based on weighted ensemble
The clustering is an ill-posed problem and it has been proven that there is no algorithm
that would satisfy all the assumptions about good clustering. This is why numerous
clustering algorithms exist, based on various theories and approaches, one of
them being the well-known Kohonen’s self-organizing map (SOM). Unfortunately,
after training the SOM there is no explicitly obtained information about clusters in
the underlying data, so another technique for grouping SOM units has to be applied
afterwards. In the thesis, a contribution towards a two-level clustering of the SOM
is presented, employing principles of Gravitational Law. The proposed algorithm for
gravitational clustering of the SOM (gSOM) is capable of discovering complex cluster
shapes, not only limited to the spherical ones, and is able to automatically determine
the number of clusters. Experimental comparison with other clustering techniques is
conducted on synthetic and real-world data. We show that gSOM achieves promising
results especially on gene-expression data.
As there is no clustering algorithm that can solve all the problems, it turns out as
very beneficial to analyse the data using multiple partitions of them – an ensemble of
partitions. Cluster-ensemble methods have emerged recently as an effective approach
to stabilize and boost the performance of the single-clustering algorithms. Basically,
data clustering with an ensemble involves two steps: generation of the ensemble with
single-clustering methods and the combination of the obtained solutions to produce a
final consensus partition of the data. To alleviate the consensus step the weighted cluster
ensemble was proposed that tries to assess the relevance of ensemble members. One
way to achieve this is to employ internal cluster validity indices to perform partition
relevance analysis (PRA). Our contribution here is two-fold: first, we propose a novel
cluster validity index DNs that extends the Dunn’s index and is based on the shortest
paths between the data points considering the Gabriel graph on the data; second, we propose an enhancement to the weighted cluster ensemble approach by introducing the
reduction step after the assessment of the ensemble partitions is done. The developed
partition relevance analysis with the reduction step (PRAr) yields promising results
when plugged in the three consensus functions, based on the evidence accumulation
principle.
In the thesis we address all the major stages of data clustering: data generation, data
analysis using single-clustering algorithms, cluster validity using internal end external
indices, and finally the cluster ensemble approach with the focus on the weighted variants.
All the contributions are compared to the state-of-art methods using datasets
from various problem domains. Results are positive and encourage the inclusion of
the proposed algorithms in the machine-learning practitioner’s toolbox
A Method to Improve the Analysis of Cluster Ensembles
Clustering is fundamental to understand the structure of data. In the past decade the cluster ensembleproblem has been introduced, which combines a set of partitions (an ensemble) of the data to obtain a singleconsensus solution that outperforms all the ensemble members. However, there is disagreement about which arethe best ensemble characteristics to obtain a good performance: some authors have suggested that highly differentpartitions within the ensemble are beneï¬ cial for the ï¬ nal performance, whereas others have stated that mediumdiversity among them is better. While there are several measures to quantify the diversity, a better method toanalyze the best ensemble characteristics is necessary. This paper introduces a new ensemble generation strategyand a method to make slight changes in its structure. Experimental results on six datasets suggest that this isan important step towards a more systematic approach to analyze the impact of the ensemble characteristics onthe overall consensus performance.Fil: Pividori, Milton Damián. Universidad Tecnologica Nacional. Facultad Regional Santa Fe. Centro de Investigacion y Desarrollo de Ingenieria en Sistemas de Informacion; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; Argentina. Universidad Tecnologica Nacional. Facultad Regional Santa Fe. Centro de Investigacion y Desarrollo de Ingenieria en Sistemas de Informacion; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; Argentin
Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
The clustering ensemble technique aims to combine multiple clusterings into a
probably better and more robust clustering and has been receiving an increasing
attention in recent years. There are mainly two aspects of limitations in the
existing clustering ensemble approaches. Firstly, many approaches lack the
ability to weight the base clusterings without access to the original data and
can be affected significantly by the low-quality, or even ill clusterings.
Secondly, they generally focus on the instance level or cluster level in the
ensemble system and fail to integrate multi-granularity cues into a unified
model. To address these two limitations, this paper proposes to solve the
clustering ensemble problem via crowd agreement estimation and
multi-granularity link analysis. We present the normalized crowd agreement
index (NCAI) to evaluate the quality of base clusterings in an unsupervised
manner and thus weight the base clusterings in accordance with their clustering
validity. To explore the relationship between clusters, the source aware
connected triple (SACT) similarity is introduced with regard to their common
neighbors and the source reliability. Based on NCAI and multi-granularity
information collected among base clusterings, clusters, and data instances, we
further propose two novel consensus functions, termed weighted evidence
accumulation clustering (WEAC) and graph partitioning with multi-granularity
link analysis (GP-MGLA) respectively. The experiments are conducted on eight
real-world datasets. The experimental results demonstrate the effectiveness and
robustness of the proposed methods.Comment: The MATLAB source code of this work is available at:
https://www.researchgate.net/publication/28197031
Paradigm of tunable clustering using binarization of consensus partition matrices (Bi-CoPaM) for gene discovery
Copyright @ 2013 Abu-Jamous et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.National Institute for Health Researc
Yeast gene CMR1/YDL156W is consistently co-expressed with genes participating in DNA-metabolic processes in a variety of stringent clustering experiments
© 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited.The binarization of consensus partition matrices (Bi-CoPaM) method has, among its unique features, the ability to perform ensemble clustering over the same set of genes from multiple microarray datasets by using various clustering methods in order to generate tunable tight clusters. Therefore, we have used the Bi-CoPaM method to the most synchronized 500 cell-cycle-regulated yeast genes from different microarray datasets to produce four tight, specific and exclusive clusters of co-expressed genes. We found 19 genes formed the tightest of the four clusters and this included the gene CMR1/YDL156W, which was an uncharacterized gene at the time of our investigations. Two very recent proteomic and biochemical studies have independently revealed many facets of CMR1 protein, although the precise functions of the protein remain to be elucidated. Our computational results complement these biological results and add more evidence to their recent findings of CMR1 as potentially participating in many of the DNA-metabolism processes such as replication, repair and transcription. Interestingly, our results demonstrate the close co-expressions of CMR1 and the replication protein A (RPA), the cohesion complex and the DNA polymerases α, δ and ɛ, as well as suggest functional relationships between CMR1 and the respective proteins. In addition, the analysis provides further substantial evidence that the expression of the CMR1 gene could be regulated by the MBF complex. In summary, the application of a novel analytic technique in large biological datasets has provided supporting evidence for a gene of previously unknown function, further hypotheses to test, and a more general demonstration of the value of sophisticated methods to explore new large datasets now so readily generated in biological experiments.National Institute for Health Researc
- …