Search CORE

57,503 research outputs found

Comparing and Contrasting Clustering Analysis Methods: K-means and Vector in Partition

Author: Sobral Lauren
Publication venue: University of Memphis Digital Commons
Publication date: 27/11/2018
Field of study

This paper delves into the similarities and differences between two methods of exploratory cluster analysis, K-means and Vector in Partition. Known as the traditional clustering approach, K-means does have some limitations when dealing with clustering complex datasets, specifically datasets with variables of multidimensional vectors. This is the gap the Vector in Partition (VIP) algorithm aims to fill. As a novel approach for clustering multidimensional datasets of both continuous and categorical data, the VIP algorithm has preliminary results that support its ability to correctly cluster simulated datasets of the genetic factors, gene expression, DNA methylation, and single nucleotide polymorphisms. After explaining both the K-means algorithm and the VIP algorithm, an example will be presented of simulated genetic data containing variables with multidimensional vectors that will be analyzed with both algorithms. The results will then be summarized using accuracy, sensitivity, and specificity while highlighting the benefits and limitations of each clustering method

University of Memphis Digital Commons

Transductive-Inductive Cluster Approximation Via Multivariate Chebyshev Inequality

Author: Sinha Shriprakash
Publication venue
Publication date: 19/06/2012
Field of study

Approximating adequate number of clusters in multidimensional data is an open area of research, given a level of compromise made on the quality of acceptable results. The manuscript addresses the issue by formulating a transductive inductive learning algorithm which uses multivariate Chebyshev inequality. Considering clustering problem in imaging, theoretical proofs for a particular level of compromise are derived to show the convergence of the reconstruction error to a finite value with increasing (a) number of unseen examples and (b) the number of clusters, respectively. Upper bounds for these error rates are also proved. Non-parametric estimates of these error from a random sample of sequences empirically point to a stable number of clusters. Lastly, the generalization of algorithm can be applied to multidimensional data sets from different fields.Comment: 16 pages, 5 figure

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration

Author: A Fiannaca
A Fiannaca
A Fiannaca
A Truszkowski
A Ultsch
A Ultsch
Alfonso Urso
Antonino Fiannaca
C Borgelt
CA Goble
D Digles
G Di Fatta
Giuseppe Di Fatta
HE Pence
J Hastings
K Wolstencroft
M Hall
Massimo La Rosa
N Belacel
P Ertl
Riccardo Rizzo
S Jupp
S Riniker
Salvatore Gaglio
T Kohonen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets

Central Archive at the University of Reading

Crossref

Springer - Publisher Connector

PubMed Central

Archivio istituzionale della ricerca - Università di Palermo

Clustering multivariate spatial data based on local measures of spatial autocorrelation.

Author: Luca Scrucca
Publication venue
Publication date
Field of study

A growing interest in clustering spatial data is emerging in several areas, from local economic development to epidemiology, from remote sensing data to environment analyses. However, methods and procedures to face such problem are still lacking. Local measures of spatial autocorrelation aim at identifying patterns of spatial dependence within the study region. Mapping these measures provide the basic building block for identifying spatial clusters of units. If this may work satisfactorily in the univariate case, most of the real problems have a multidimensional nature. Thus, we need a clustering method based on both the multivariate data information and the spatial distribution of units. In this paper we propose a procedure for exploring and discover patterns of spatial clustering. We discuss an implementation of the popular partitioning algorithm known as K-means which incorporates the spatial structure of the data through the use of local measures of spatial autocorrelation. An example based on a set of variables related to the labour market of the Italian region Umbria is presented and deeply discussed.

Research Papers in Economics

Data reduction for spectral clustering to analyze high throughput flow cytometry data

Author: Brinkman Ryan R.
Gupta Arvind
Shooshtari Parisa
Zare Habil
Publication venue: Scholarship@Western
Publication date: 28/07/2010
Field of study

Background: Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.Results: We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., events in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.Conclusions: This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor. © 2010 Zare et al; licensee BioMed Central Ltd

Scholarship@Western