Search CORE

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee

ResearchOnline at James Cook University

Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data

Author: Pablo A. Jaskowiak
Ricardo J. G. B. Campello
Publication venue
Publication date: 01/01/2011
Field of study

Abstract. An important analysis performed in gene expression data is sample classification, e.g., the classification of different types or subtypes of cancer. Different classifiers have been employed for this challenging task, among which the k-Nearest Neighbors (kNN) classifier stands out for being at the same time very simple and highly flexible in terms of discriminatory power. Although the choice of a dissimilarity measure is essential to kNN, little effort has been undertaken to evaluate how this choice affects its performance in cancer classification. To this extent, we compare seven correlation coefficients for cancer classification using kNN. Our comparison suggests that a recently introduced correlation may perform better than commonly used measures. We also show that correlation coefficients rarely considered can provide competitive results when compared to widely used dissimilarity measures

CiteSeerX

ResearchOnline at James Cook University

Evaluating correlation coefficients for clustering gene expression profiles of cancer

Author: Campello Ricardo J.G.B.
Costa Ivan G.
Jaskowiak Pablo A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it

ResearchOnline at James Cook University

Changes in selective affinity following transdetermination in imaginal disc cells of Drosophila melanogaster

Author: Campello Ricardo J.G.B.
Costa Ivan G.
Jaskowiak Pablo. A.
Publication venue: Elsevier
Publication date: 01/11/1966
Field of study

Cell recombinates of e and y; mwh wing disc cells cultured in vivo can differentiate allotypic as well as autotypic structures. The same is also true for foreleg homonomic recombinates. As the culture time and the regenerative growth increases, the frequency of transdetermination also rises. Under these conditions e cells transdetermine more easily than y; mwh ones, and foreleg cells more frequently than wing cells. On the other hand, transdetermination into leg or wing allotypic structures is more frequent than into head structures. In heteronomic combination of foreleg and wing disc cells, wing transdetermined cells (deriving from foreleg) can move out and integrate into chimeric patterns within wing conservative territories. The reverse is also true for leg transdetermined cells. From considerations of the size and completeness of the transdetermined territories, it is assumed that the transdetermination can occur in an entire blastema rather than in single cells. The nature of the transdetermination phenomenon and its bearing on cell-specific differentiation and on cellspecific affinities is discussed.Peer reviewe

ResearchOnline at James Cook University

Publikationsserver der RWTH Aachen University

Digital.CSIC

Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

Author: Ivan G. Costa
Pablo A. Jaskowiak
Ricardo J.G.B. Campello
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

On the selection of appropriate distances for gene expression data clustering

Author: Campello Ricardo J. G. B.
Costa Ivan G.
Jaskowiak Pablo A.
Publication venue: BioMed Central
Publication date: 01/01/2014
Field of study

Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

ResearchOnline at James Cook University

PubMed Central

Publikationsserver der RWTH Aachen University

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Universidade de São Paulo

A comparative study on the use of correlation coefficients for redundant feature elimination

Author: Campello Ricardo J.G.B.
Covões Thiago F.
Hruschka Eduardo R.
Jaskowiak Pablo A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Simplified Silhouette Filter (SSF) is a recently introduced feature selection method that automatically estimates the number of features to be selected. To do so, a sampling strategy is combined with a clustering algorithm that seeks clusters of correlated (potentially redundant) features. It is well known that the choice of a similarity measure may have great impact in clustering results. As a consequence, in this application scenario, this choice may have great impact in the feature subset to be selected. In this paper we study six correlation coefficients as similarity measures in the clustering stage of SSF, thus giving rise to several variants of the original method. The obtained results show that, in particular scenarios, some correlation measures select fewer features than others, while providing accurate classifiers

ResearchOnline at James Cook University

Density-based clustering validation

Author: Campello Ricardo J.G.B.
Jaskowiak Pablo A.
Moulavi Davoud
Sander Jörg
Zimek Arthur
Publication venue: SIAM
Publication date: 01/01/2014
Field of study

One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algo rithms and their respective appropriate parameters

ResearchOnline@JCU