380,466 research outputs found
SOTXTSTREAM: Density-based self-organizing clustering of text streams
A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets
Cluster Evaluation of Density Based Subspace Clustering
Clustering real world data often faced with curse of dimensionality, where
real world data often consist of many dimensions. Multidimensional data
clustering evaluation can be done through a density-based approach. Density
approaches based on the paradigm introduced by DBSCAN clustering. In this
approach, density of each object neighbours with MinPoints will be calculated.
Cluster change will occur in accordance with changes in density of each object
neighbours. The neighbours of each object typically determined using a distance
function, for example the Euclidean distance. In this paper SUBCLU, FIRES and
INSCY methods will be applied to clustering 6x1595 dimension synthetic
datasets. IO Entropy, F1 Measure, coverage, accurate and time consumption used
as evaluation performance parameters. Evaluation results showed SUBCLU method
requires considerable time to process subspace clustering; however, its value
coverage is better. Meanwhile INSCY method is better for accuracy comparing
with two other methods, although consequence time calculation was longer.Comment: 6 pages, 15 figure
Dynamic feature selection for clustering high dimensional data streams
open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked
Generalized density clustering
We study generalized density-based clustering in which sharply defined
clusters such as clusters on lower-dimensional manifolds are allowed. We show
that accurate clustering is possible even in high dimensions. We propose two
data-based methods for choosing the bandwidth and we study the stability
properties of density clusters. We show that a simple graph-based algorithm
successfully approximates the high density clusters.Comment: Published in at http://dx.doi.org/10.1214/10-AOS797 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …