933 research outputs found
Comparing high dimensional partitions, with the Coclustering Adjusted Rand Index
We consider the simultaneous clustering of rows and columns of a matrix and
more particularly the ability to measure the agreement between two
co-clustering partitions. The new criterion we developed is based on the
Adjusted Rand Index and is called the Co-clustering Adjusted Rand Index named
CARI. We also suggest new improvements to existing criteria such as the
Classification Error which counts the proportion of misclassified cells and the
Extended Normalized Mutual Information criterion which is a generalization of
the criterion based on mutual information in the case of classic
classifications. We study these criteria with regard to some desired properties
deriving from the co-clustering context. Experiments on simulated and real
observed data are proposed to compare the behavior of these criteria.Comment: 52 page
Analysis of a Gibbs sampler method for model based clustering of gene expression data
Over the last decade, a large variety of clustering algorithms have been
developed to detect coregulatory relationships among genes from microarray gene
expression data. Model based clustering approaches have emerged as
statistically well grounded methods, but the properties of these algorithms
when applied to large-scale data sets are not always well understood. An
in-depth analysis can reveal important insights about the performance of the
algorithm, the expected quality of the output clusters, and the possibilities
for extracting more relevant information out of a particular data set. We have
extended an existing algorithm for model based clustering of genes to
simultaneously cluster genes and conditions, and used three large compendia of
gene expression data for S. cerevisiae to analyze its properties. The algorithm
uses a Bayesian approach and a Gibbs sampling procedure to iteratively update
the cluster assignment of each gene and condition. For large-scale data sets,
the posterior distribution is strongly peaked on a limited number of
equiprobable clusterings. A GO annotation analysis shows that these local
maxima are all biologically equally significant, and that simultaneously
clustering genes and conditions performs better than only clustering genes and
assuming independent conditions. A collection of distinct equivalent
clusterings can be summarized as a weighted graph on the set of genes, from
which we extract fuzzy, overlapping clusters using a graph spectral method. The
cores of these fuzzy clusters contain tight sets of strongly coexpressed genes,
while the overlaps exhibit relations between genes showing only partial
coexpression.Comment: 8 pages, 7 figure
Probabilistic Clustering of Sequences: Inferring new bacterial regulons by comparative genomics
Genome wide comparisons between enteric bacteria yield large sets of
conserved putative regulatory sites on a gene by gene basis that need to be
clustered into regulons. Using the assumption that regulatory sites can be
represented as samples from weight matrices we derive a unique probability
distribution for assignments of sites into clusters. Our algorithm, 'PROCSE'
(probabilistic clustering of sequences), uses Monte-Carlo sampling of this
distribution to partition and align thousands of short DNA sequences into
clusters. The algorithm internally determines the number of clusters from the
data, and assigns significance to the resulting clusters. We place theoretical
limits on the ability of any algorithm to correctly cluster sequences drawn
from weight matrices (WMs) when these WMs are unknown. Our analysis suggests
that the set of all putative sites for a single genome (e.g. E. coli) is
largely inadequate for clustering. When sites from different genomes are
combined and all the homologous sites from the various species are used as a
block, clustering becomes feasible. We predict 50-100 new regulons as well as
many new members of existing regulons, potentially doubling the number of known
regulatory sites in E. coli.Comment: 27 pages including 9 figures and 3 table
Étude des corrélations spatio-temporelles des appels mobiles en France
International audienceNous proposons dans cet article de présenter une application d'analyse d'une base de données de grande taille issue du secteur des télécommunications. Le problème consiste à segmenter un territoire et caractériser les zones ainsi définies grâce au comportement des habitants en terme de téléphonie mobile. Nous disposons pour cela d'un réseau d'appels inter-antennes construit pendant une période de cinq mois sur l'ensemble de la France. Nous proposons une analyse en deux phases. La première couple les antennes émettrices dont les appels sont similairement distribués sur les antennes réceptrices et vice versa. Une projection de ces groupes d'antennes sur une carte de France permet une visualisation des corrélations entre la géographie du territoire et le comportement de ses habitants en terme de téléphonie. La seconde phase découpe l'année en périodes entre lesquelles on observe un changement de distributions d'appels sortant des groupes d'antennes. On peut ainsi caractériser l'évolution temporelle du comportement des usagers de mobiles dans chacune des zones du pays
Recent Developments in Document Clustering
This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed
- …