57,503 research outputs found

    Comparing and Contrasting Clustering Analysis Methods: K-means and Vector in Partition

    Get PDF
    This paper delves into the similarities and differences between two methods of exploratory cluster analysis, K-means and Vector in Partition. Known as the traditional clustering approach, K-means does have some limitations when dealing with clustering complex datasets, specifically datasets with variables of multidimensional vectors. This is the gap the Vector in Partition (VIP) algorithm aims to fill. As a novel approach for clustering multidimensional datasets of both continuous and categorical data, the VIP algorithm has preliminary results that support its ability to correctly cluster simulated datasets of the genetic factors, gene expression, DNA methylation, and single nucleotide polymorphisms. After explaining both the K-means algorithm and the VIP algorithm, an example will be presented of simulated genetic data containing variables with multidimensional vectors that will be analyzed with both algorithms. The results will then be summarized using accuracy, sensitivity, and specificity while highlighting the benefits and limitations of each clustering method

    Transductive-Inductive Cluster Approximation Via Multivariate Chebyshev Inequality

    Full text link
    Approximating adequate number of clusters in multidimensional data is an open area of research, given a level of compromise made on the quality of acceptable results. The manuscript addresses the issue by formulating a transductive inductive learning algorithm which uses multivariate Chebyshev inequality. Considering clustering problem in imaging, theoretical proofs for a particular level of compromise are derived to show the convergence of the reconstruction error to a finite value with increasing (a) number of unseen examples and (b) the number of clusters, respectively. Upper bounds for these error rates are also proved. Non-parametric estimates of these error from a random sample of sequences empirically point to a stable number of clusters. Lastly, the generalization of algorithm can be applied to multidimensional data sets from different fields.Comment: 16 pages, 5 figure

    Clustering multivariate spatial data based on local measures of spatial autocorrelation.

    Get PDF
    A growing interest in clustering spatial data is emerging in several areas, from local economic development to epidemiology, from remote sensing data to environment analyses. However, methods and procedures to face such problem are still lacking. Local measures of spatial autocorrelation aim at identifying patterns of spatial dependence within the study region. Mapping these measures provide the basic building block for identifying spatial clusters of units. If this may work satisfactorily in the univariate case, most of the real problems have a multidimensional nature. Thus, we need a clustering method based on both the multivariate data information and the spatial distribution of units. In this paper we propose a procedure for exploring and discover patterns of spatial clustering. We discuss an implementation of the popular partitioning algorithm known as K-means which incorporates the spatial structure of the data through the use of local measures of spatial autocorrelation. An example based on a set of variables related to the labour market of the Italian region Umbria is presented and deeply discussed.

    Data reduction for spectral clustering to analyze high throughput flow cytometry data

    Get PDF
    Background: Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.Results: We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., events in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.Conclusions: This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor. © 2010 Zare et al; licensee BioMed Central Ltd
    corecore