7,304 research outputs found

    On clustering stability

    Get PDF
    JEL Classification: C100; C150; C380This work is dedicated to the evaluation of the stability of clustering solutions, namely the stability of crisp clusterings or partitions. We specifically refer to stability as the concordance of clusterings across several samples. In order to evaluate stability, we use a weighted cross-validation procedure, the result of which is summarized by simple and paired agreement indices values. To exclude the amount of agreement by chance of these values, we propose a new method – IADJUST – that resorts to simulated crossclassification tables. This contribution makes viable the correction of any index of agreement. Experiments on stability rely on 540 simulated data sets, design factors being the number of clusters, their balance and overlap. Six real data with a priori known clusters are also considered. The experiments conducted enable to illustrate the precision and pertinence of the IADJUST procedure and allow to know the distribution of indices under the hypothesis of agreement by chance. Therefore, we recommend the use of adjusted indices to be common practice when addressing stability. We then compare the stability of two clustering algorithms and conclude that Expectation-Maximization (EM) results are more stable when referring to unbalanced data sets than K means results. Finally, we explore the relationship between stability and external validity of a clustering solution. When all experimental scenarios’ results are considered there is a strong correlation between stability and external validity. However, within a specific experimental scenario (when a practical clustering task is considered), we find no relationship between stability and agreement with ground truth.Este trabalho é dedicado à avaliação da estabilidade de agrupamentos, nomeadamente de partições. Consideramos a estabilidade como sendo a concordância dos agrupamentos obtidos sobre diversas amostras. Para avaliar a estabilidade, usamos um procedimento de validação cruzada ponderada, cujo resultado é resumido pelos valores de índices de concordância simples e pareados. Para excluir, destes valores, a parcela de concordância por acaso, propomos um novo método - IADJUST - que recorre à simulação de tabelas cruzadas de classificação. Essa contribuição torna viável a correção de qualquer índice de concordância. A análise experimental da estabilidade baseia-se em 540 conjuntos de dados simulados, controlando os números de grupos, dimensões relativas e graus de sobreposição dos grupos. Também consideramos seis conjuntos de dados reais com classes a priori conhecidas. As experiências realizadas permitem ilustrar a precisão e pertinência do procedimento IADJUST e conhecer a distribuição dos índices sob a hipótese de concordância por acaso. Assim sendo, recomendamos a utilização de índices ajustados como prática comum ao abordar a estabilidade. Comparamos, então, a estabilidade de dois algoritmos de agrupamento e concluímos que as soluções do algoritmo Expectation Maximization são mais estáveis que as do K-médias em conjuntos de dados não balanceados. Finalmente, estudamos a relação entre a estabilidade e validade externa de um agrupamento. Agregando os resultados dos cenários experimentais obtemos uma forte correlação entre estabilidade e validade externa. No entanto, num cenário experimental particular (para uma tarefa prática de agrupamento), não encontramos relação entre estabilidade e a concordância com a verdadeira estrutura dos dados

    Clustering stability and ground truth: numerical experiments

    Get PDF
    Stability has been considered an important property for evaluating clustering solutions. Nevertheless, there are no conclusive studies on the relationship between this property and the capacity to recover clusters inherent to data (“ground truth”). This study focuses on this relationship, resorting to experiments on synthetic data generated under diverse scenarios (controlling relevant factors) and experiments on real data sets. Stability is evaluated using a weighted cross-validation procedure. Indices of agreement (corrected for agreement by chance) are used both to assess stability and external validity. The results obtained reveal a new perspective so far not mentioned in the literature. Despite the clear relationship between stability and external validity when a broad range of scenarios is considered, the within-scenarios conclusions deserve our special attention: faced with a specific clustering problem (as we do in practice), there is no significant relationship between clustering stability and the ability to recover data clustersinfo:eu-repo/semantics/publishedVersio

    The categories, frequencies, and stability of idiosyncratic eye-movement patterns to faces

    Get PDF
    The spatial pattern of eye-movements to faces considered typical for neurologically healthy individuals is a roughly T-shaped distribution over the internal facial features with peak fixation density tending toward the left eye (observer's perspective). However, recent studies indicate that striking deviations from this classic pattern are common within the population and are highly stable over time. The classic pattern actually reflects the average of these various idiosyncratic eye-movement patterns across individuals. The natural categories and respective frequencies of different types of idiosyncratic eye-movement patterns have not been specifically investigated before, so here we analyzed the spatial patterns of eye-movements for 48 participants to estimate the frequency of different kinds of individual eye-movement patterns to faces in the normal healthy population. Four natural clusters were discovered such that approximately 25% of our participants' fixation density peaks clustered over the left eye region (observer's perspective), 23% over the right eye-region, 31% over the nasion/bridge region of the nose, and 20% over the region spanning the nose, philthrum, and upper lips. We did not find any relationship between particular idiosyncratic eye-movement patterns and recognition performance. Individuals' eye-movement patterns early in a trial were more stereotyped than later ones and idiosyncratic fixation patterns evolved with time into a trial. Finally, while face inversion strongly modulated eye-movement patterns, individual patterns did not become less distinct for inverted compared to upright faces. Group-averaged fixation patterns do not represent individual patterns well, so exploration of such individual patterns is of value for future studies of visual cognition

    Model-free reconstruction of neuronal network connectivity from calcium imaging signals

    Get PDF
    A systematic assessment of global neural network connectivity through direct electrophysiological assays has remained technically unfeasible even in dissociated neuronal cultures. We introduce an improved algorithmic approach based on Transfer Entropy to reconstruct approximations to network structural connectivities from network activity monitored through calcium fluorescence imaging. Based on information theory, our method requires no prior assumptions on the statistics of neuronal firing and neuronal connections. The performance of our algorithm is benchmarked on surrogate time-series of calcium fluorescence generated by the simulated dynamics of a network with known ground-truth topology. We find that the effective network topology revealed by Transfer Entropy depends qualitatively on the time-dependent dynamic state of the network (e.g., bursting or non-bursting). We thus demonstrate how conditioning with respect to the global mean activity improves the performance of our method. [...] Compared to other reconstruction strategies such as cross-correlation or Granger Causality methods, our method based on improved Transfer Entropy is remarkably more accurate. In particular, it provides a good reconstruction of the network clustering coefficient, allowing to discriminate between weakly or strongly clustered topologies, whereas on the other hand an approach based on cross-correlations would invariantly detect artificially high levels of clustering. Finally, we present the applicability of our method to real recordings of in vitro cortical cultures. We demonstrate that these networks are characterized by an elevated level of clustering compared to a random graph (although not extreme) and by a markedly non-local connectivity.Comment: 54 pages, 8 figures (+9 supplementary figures), 1 table; submitted for publicatio

    Gene Expression Profiles from Formalin Fixed Paraffin Embedded Breast Cancer Tissue Are Largely Comparable to Fresh Frozen Matched Tissue

    Get PDF
    BACKGROUND AND METHODS: Formalin Fixed Paraffin Embedded (FFPE) samples represent a valuable resource for cancer research. However, the discovery and development of new cancer biomarkers often requires fresh frozen (FF) samples. Recently, the Whole Genome (WG) DASL (cDNA-mediated Annealing, Selection, extension and Ligation) assay was specifically developed to profile FFPE tissue. However, a thorough comparison of data generated from FFPE RNA and Fresh Frozen (FF) RNA using this platform is lacking. To this end we profiled, in duplicate, 20 FFPE tissues and 20 matched FF tissues and evaluated the concordance of the DASL results from FFPE and matched FF material. METHODOLOGY AND PRINCIPAL FINDINGS: We show that after proper normalization, all FFPE and FF pairs exhibit a high level of similarity (Pearson correlation >0.7), significantly larger than the similarity between non-paired samples. Interestingly, the probes showing the highest correlation had a higher percentage G/C content and were enriched for cell cycle genes. Predictions of gene expression signatures developed on frozen material (Intrinsic subtype, Genomic Grade Index, 70 gene signature) showed a high level of concordance between FFPE and FF matched pairs. Interestingly, predictions based on a 60 gene DASL list (best match with the 70 gene signature) showed very high concordance with the MammaPrint® results. CONCLUSIONS AND SIGNIFICANCE: We demonstrate that data generated from FFPE material with the DASL assay, if properly processed, are comparable to data extracted from the FF counterpart. Specifically, gene expression profiles for a known set of prognostic genes for a specific disease are highly comparable between two conditions. This opens up the possibility of using both FFPE and FF material in gene expressions analyses, leading to a vast increase in the potential resources available for cancer research

    An efficient kk-means-type algorithm for clustering datasets with incomplete records

    Get PDF
    The kk-means algorithm is arguably the most popular nonparametric clustering method but cannot generally be applied to datasets with incomplete records. The usual practice then is to either impute missing values under an assumed missing-completely-at-random mechanism or to ignore the incomplete records, and apply the algorithm on the resulting dataset. We develop an efficient version of the kk-means algorithm that allows for clustering in the presence of incomplete records. Our extension is called kmk_m-means and reduces to the kk-means algorithm when all records are complete. We also provide initialization strategies for our algorithm and methods to estimate the number of groups in the dataset. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and Data Mining -- The ASA Data Science Journal, 201
    corecore