871 research outputs found
Analysis of a Gibbs sampler method for model based clustering of gene expression data
Over the last decade, a large variety of clustering algorithms have been
developed to detect coregulatory relationships among genes from microarray gene
expression data. Model based clustering approaches have emerged as
statistically well grounded methods, but the properties of these algorithms
when applied to large-scale data sets are not always well understood. An
in-depth analysis can reveal important insights about the performance of the
algorithm, the expected quality of the output clusters, and the possibilities
for extracting more relevant information out of a particular data set. We have
extended an existing algorithm for model based clustering of genes to
simultaneously cluster genes and conditions, and used three large compendia of
gene expression data for S. cerevisiae to analyze its properties. The algorithm
uses a Bayesian approach and a Gibbs sampling procedure to iteratively update
the cluster assignment of each gene and condition. For large-scale data sets,
the posterior distribution is strongly peaked on a limited number of
equiprobable clusterings. A GO annotation analysis shows that these local
maxima are all biologically equally significant, and that simultaneously
clustering genes and conditions performs better than only clustering genes and
assuming independent conditions. A collection of distinct equivalent
clusterings can be summarized as a weighted graph on the set of genes, from
which we extract fuzzy, overlapping clusters using a graph spectral method. The
cores of these fuzzy clusters contain tight sets of strongly coexpressed genes,
while the overlaps exhibit relations between genes showing only partial
coexpression.Comment: 8 pages, 7 figure
Silhouette + Attraction: A Simple and Effective Method for Text Clustering
[EN] This article presents silhouette attraction (Sil Att), a simple and effective method for text clustering, which is based on two main concepts: the silhouette coefficient and the idea of attraction. The combination of both principles allows us to obtain a general technique that can be used either as a boosting method, which improves results of other clustering algorithms, or as an independent clustering algorithm. The experimental work shows that Sil Att is able to obtain high-quality results on text corpora with very different characteristics. Furthermore, its stable performance on all the considered corpora is indicative that it is a very robust method. This is a very interesting positive aspect of Sil Att with respect to the other algorithms used in the experiments, whose performances heavily depend on specific characteristics of the corpora being considered.This research work has been partially funded by UNSL, CONICET (Argentina), DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project, and the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework on Web Information Quality Evaluation Initiative. The work of the third author was done also in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Errecalde, M.; Cagnina, L.; Rosso, P. (2015). Silhouette + Attraction: A Simple and Effective Method for Text Clustering. Natural Language Engineering. 1-40. https://doi.org/10.1017/S1351324915000273S140Zhao, Y., & Karypis, G. (2004). Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 55(3), 311-331. doi:10.1023/b:mach.0000027785.44527.d6Tu, L., & Chen, Y. (2009). Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data, 3(3), 1-27. doi:10.1145/1552303.1552305Yang, T., Jin, R., Chi, Y., & Zhu, S. (2009). Combining link and content for community detection. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09. doi:10.1145/1557019.1557120Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery, 10(2), 141-168. doi:10.1007/s10618-005-0361-3Kaufman, L., & Rousseeuw, P. J. (Eds.). (1990). Finding Groups in Data. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801Karypis, G., Eui-Hong Han, & Kumar, V. (1999). Chameleon: hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75. doi:10.1109/2.781637Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014). An efficient Particle Swarm Optimization approach to cluster short texts. Information Sciences, 265, 36-49. doi:10.1016/j.ins.2013.12.010He, H., Chen, B., Xu, W., & Guo, J. (2007). Short Text Feature Extraction and Clustering for Web Topic Mining. Third International Conference on Semantics, Knowledge and Grid (SKG 2007). doi:10.1109/skg.2007.76Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72. doi:10.2307/1412159Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. doi:10.1016/0377-0427(87)90125-7Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. doi:10.1017/cbo9780511809071Qi, G.-J., Aggarwal, C. C., & Huang, T. (2012). Community Detection with Edge Content in Social Media Networks. 2012 IEEE 28th International Conference on Data Engineering. doi:10.1109/icde.2012.77Daxin Jiang, Jian Pei, & Aidong Zhang. (s. f.). DHC: a density-based hierarchical clustering method for time series gene expression data. Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings. doi:10.1109/bibe.2003.1188978Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (2004). Incremental Clustering and Dynamic Information Retrieval. SIAM Journal on Computing, 33(6), 1417-1440. doi:10.1137/s0097539702418498Selim, S. Z., & Alsultan, K. (1991). A simulated annealing algorithm for the clustering problem. Pattern Recognition, 24(10), 1003-1008. doi:10.1016/0031-3203(91)90097-oAranganayagi, S., & Thangavel, K. (2007). Clustering Categorical Data Using Silhouette Coefficient as a Relocating Measure. International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007). doi:10.1109/iccima.2007.328Makagonov, P., Alexandrov, M., & Gelbukh, A. (2004). Clustering Abstracts Instead of Full Texts. Lecture Notes in Computer Science, 129-135. doi:10.1007/978-3-540-30120-2_17Jing L. 2005. Survey of text clustering. Technical report. Department of Mathematics. The University of Hong Kong, Hong Kong, China.Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. doi:10.1002/j.1538-7305.1948.tb01338.xHearst, M. A. (2006). Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4), 59. doi:10.1145/1121949.1121983Alexandrov, M., Gelbukh, A., & Rosso, P. (2005). An Approach to Clustering Abstracts. Lecture Notes in Computer Science, 275-285. doi:10.1007/11428817_25Dos Santos, J. B., Heuser, C. A., Moreira, V. P., & Wives, L. K. (2011). Automatic threshold estimation for data matching applications. Information Sciences, 181(13), 2685-2699. doi:10.1016/j.ins.2010.05.029Hasan, M. A., Chaoji, V., Salem, S., & Zaki, M. J. (2009). Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11), 994-1002. doi:10.1016/j.patrec.2009.04.013Dunn†, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95-104. doi:10.1080/01969727408546059Carullo, M., Binaghi, E., & Gallo, I. (2009). An online document clustering technique for short web contents. Pattern Recognition Letters, 30(10), 870-876. doi:10.1016/j.patrec.2009.04.001Kruskal, W. H., & Wallis, W. A. (1952). Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association, 47(260), 583-621. doi:10.1080/01621459.1952.10483441Bezdek, J. C., & Pal, N. R. (s. f.). Cluster validation with generalized Dunn’s indices. Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. doi:10.1109/annes.1995.499469Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., & Dougherty, E. R. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3), 807-824. doi:10.1016/j.patcog.2006.06.026Davies, D. L., & Bouldin, D. W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224-227. doi:10.1109/tpami.1979.4766909Pinto, D., & Rosso, P. (s. f.). On the Relative Hardness of Clustering Corpora. Lecture Notes in Computer Science, 155-161. doi:10.1007/978-3-540-74628-7_22Pons-Porrata, A., Berlanga-Llavori, R., & Ruiz-Shulcloper, J. (2007). Topic discovery based on text mining techniques. Information Processing & Management, 43(3), 752-768. doi:10.1016/j.ipm.2006.06.001Pinto, D., BenedĂ, J.-M., & Rosso, P. (2007). Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. Lecture Notes in Computer Science, 611-622. doi:10.1007/978-3-540-70939-8_5
A Bayesian alternative to mutual information for the hierarchical clustering of dependent random variables
The use of mutual information as a similarity measure in agglomerative
hierarchical clustering (AHC) raises an important issue: some correction needs
to be applied for the dimensionality of variables. In this work, we formulate
the decision of merging dependent multivariate normal variables in an AHC
procedure as a Bayesian model comparison. We found that the Bayesian
formulation naturally shrinks the empirical covariance matrix towards a matrix
set a priori (e.g., the identity), provides an automated stopping rule, and
corrects for dimensionality using a term that scales up the measure as a
function of the dimensionality of the variables. Also, the resulting log Bayes
factor is asymptotically proportional to the plug-in estimate of mutual
information, with an additive correction for dimensionality in agreement with
the Bayesian information criterion. We investigated the behavior of these
Bayesian alternatives (in exact and asymptotic forms) to mutual information on
simulated and real data. An encouraging result was first derived on
simulations: the hierarchical clustering based on the log Bayes factor
outperformed off-the-shelf clustering techniques as well as raw and normalized
mutual information in terms of classification accuracy. On a toy example, we
found that the Bayesian approaches led to results that were similar to those of
mutual information clustering techniques, with the advantage of an automated
thresholding. On real functional magnetic resonance imaging (fMRI) datasets
measuring brain activity, it identified clusters consistent with the
established outcome of standard procedures. On this application, normalized
mutual information had a highly atypical behavior, in the sense that it
systematically favored very large clusters. These initial experiments suggest
that the proposed Bayesian alternatives to mutual information are a useful new
tool for hierarchical clustering
Hierarchically Clustered Representation Learning
The joint optimization of representation learning and clustering in the
embedding space has experienced a breakthrough in recent years. In spite of the
advance, clustering with representation learning has been limited to flat-level
categories, which often involves cohesive clustering with a focus on instance
relations. To overcome the limitations of flat clustering, we introduce
hierarchically-clustered representation learning (HCRL), which simultaneously
optimizes representation learning and hierarchical clustering in the embedding
space. Compared with a few prior works, HCRL firstly attempts to consider a
generation of deep embeddings from every component of the hierarchy, not just
leaf components. In addition to obtaining hierarchically clustered embeddings,
we can reconstruct data by the various abstraction levels, infer the intrinsic
hierarchical structure, and learn the level-proportion features. We conducted
evaluations with image and text domains, and our quantitative analyses showed
competent likelihoods and the best accuracies compared with the baselines.Comment: 10 pages, 7 figures, Under review as a conference pape
Feedback-Driven Data Clustering
The acquisition of data and its analysis has become a common yet critical task in many areas of modern economy and research. Unfortunately, the ever-increasing scale of datasets has long outgrown the capacities and abilities humans can muster to extract information from them and gain new knowledge. For this reason, research areas like data mining and knowledge discovery steadily gain importance. The algorithms they provide for the extraction of knowledge are mandatory prerequisites that enable people to analyze large amounts of information. Among the approaches offered by these areas, clustering is one of the most fundamental. By finding groups of similar objects inside the data, it aims to identify meaningful structures that constitute new knowledge. Clustering results are also often used as input for other analysis techniques like classification or forecasting.
As clustering extracts new and unknown knowledge, it obviously has no access to any form of ground truth. For this reason, clustering results have a hypothetical character and must be interpreted with respect to the application domain. This makes clustering very challenging and leads to an extensive and diverse landscape of available algorithms. Most of these are expert tools that are tailored to a single narrowly defined application scenario. Over the years, this specialization has become a major trend that arose to counter the inherent uncertainty of clustering by including as much domain specifics as possible into algorithms. While customized methods often improve result quality, they become more and more complicated to handle and lose versatility. This creates a dilemma especially for amateur users whose numbers are increasing as clustering is applied in more and more domains. While an abundance of tools is offered, guidance is severely lacking and users are left alone with critical tasks like algorithm selection, parameter configuration and the interpretation and adjustment of results.
This thesis aims to solve this dilemma by structuring and integrating the necessary steps of clustering into a guided and feedback-driven process. In doing so, users are provided with a default modus operandi for the application of clustering. Two main components constitute the core of said process: the algorithm management and the visual-interactive interface. Algorithm management handles all aspects of actual clustering creation and the involved methods. It employs a modular approach for algorithm description that allows users to understand, design, and compare clustering techniques with the help of building blocks. In addition, algorithm management offers facilities for the integration of multiple clusterings of the same dataset into an improved solution. New approaches based on ensemble clustering not only allow the utilization of different clustering techniques, but also ease their application by acting as an abstraction layer that unifies individual parameters. Finally, this component provides a multi-level interface that structures all available control options and provides the docking points for user interaction.
The visual-interactive interface supports users during result interpretation and adjustment. For this, the defining characteristics of a clustering are communicated via a hybrid visualization. In contrast to traditional data-driven visualizations that tend to become overloaded and unusable with increasing volume/dimensionality of data, this novel approach communicates the abstract aspects of cluster composition and relations between clusters. This aspect orientation allows the use of easy-to-understand visual components and makes the visualization immune to scale related effects of the underlying data. This visual communication is attuned to a compact and universally valid set of high-level feedback that allows the modification of clustering results. Instead of technical parameters that indirectly cause changes in the whole clustering by influencing its creation process, users can employ simple commands like merge or split to directly adjust clusters.
The orchestrated cooperation of these two main components creates a modus operandi, in which clusterings are no longer created and disposed as a whole until a satisfying result is obtained. Instead, users apply the feedback-driven process to iteratively refine an initial solution. Performance and usability of the proposed approach were evaluated with a user study. Its results show that the feedback-driven process enabled amateur users to easily create satisfying clustering results even from different and not optimal starting situations
Nonparametric Hierarchical Clustering of Functional Data
In this paper, we deal with the problem of curves clustering. We propose a
nonparametric method which partitions the curves into clusters and discretizes
the dimensions of the curve points into intervals. The cross-product of these
partitions forms a data-grid which is obtained using a Bayesian model selection
approach while making no assumptions regarding the curves. Finally, a
post-processing technique, aiming at reducing the number of clusters in order
to improve the interpretability of the clustering, is proposed. It consists in
optimally merging the clusters step by step, which corresponds to an
agglomerative hierarchical classification whose dissimilarity measure is the
variation of the criterion. Interestingly this measure is none other than the
sum of the Kullback-Leibler divergences between clusters distributions before
and after the merges. The practical interest of the approach for functional
data exploratory analysis is presented and compared with an alternative
approach on an artificial and a real world data set
- …