72,006 research outputs found

    On discovery of extremely low-dimensional clusters using semi-supervised projected clustering

    Get PDF
    Recent studies suggest that projected clusters with extremely low dimensionality exist in many real datasets. A number of projected clustering algorithms have been proposed in the past several years, but few can identify clusters with dimensionality lower than 10% of the total number of dimensions, which are commonly found in some real datasets such as gene expression profiles. In this paper we propose a new algorithm that can accurately identify projected clusters with relevant dimensions as few as 5% of the total number of dimensions. It makes use of a robust objective function that combines object clustering and dimension selection into a single optimization problem. The algorithm can also utilize domain knowledge in the form of labeled objects and labeled dimensions to improve its clustering accuracy. We believe this is the first semi-supervised projected clustering algorithm. Both theoretical analysis and experimental results show that by using a small amount of input knowledge, possibly covering only a portion of the underlying classes, the new algorithm can be further improved to accurately detect clusters with only 1% of the dimensions being relevant. The algorithm is also useful in getting a target set of clusters when there are multiple possible groupings of the objects. © 2005 IEEE.published_or_final_versio

    Divisive clustering of high dimensional data streams

    Get PDF
    Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity

    The Clustering Characteristics of HI-Selected Galaxies from the 40% ALFALFA Survey

    Full text link
    The 40% Arecibo Legacy Fast ALFA (ALFALFA) survey catalog (\alpha.40) of approximately 10,150 HI-selected galaxies is used to analyze the clustering properties of gas-rich galaxies. By employing the Landy-Szalay estimator and a full covariance analysis for the two-point galaxy-galaxy correlation function, we obtain the real-space correlation function and model it as a power law, \xi(r) = (r/r_0)^(-\gamma), on scales less than 10 h^{-1} Mpc. As the largest sample of blindly HI-selected galaxies to date, \alpha.40 provides detailed understanding of the clustering of this population. We find \gamma = 1.51 +/- 0.09 and r_0 = 3.3 +0.3, -0.2 h^{-1} Mpc, reinforcing the understanding that gas-rich galaxies represent the most weakly clustered galaxy population known; we also observe a departure from a pure power law shape at intermediate scales, as predicted in \Lambda CDM halo occupation distribution models. Furthermore, we measure the bias parameter for the \alpha.40 galaxy sample and find that HI galaxies are severely antibiased on small scales, but only weakly antibiased on large scales. The robust measurement of the correlation function for gas-rich galaxies obtained via the \alpha.40 sample constrains models of the distribution of HI in simulated galaxies, and will be employed to better understand the role of gas in environmentally-dependent galaxy evolution.Comment: 30 pages, 10 figures, accepted by Ap
    • …
    corecore