202,863 research outputs found

    Efficient clustering of web-derived data sets

    Get PDF
    Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation. where large classes are incorrectly divided into many smaller clusters. and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well oil web-type data

    Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering

    Get PDF
    Comparing large covariance matrices has important applications in modern genomics, where scientists are often interested in understanding whether relationships (e.g., dependencies or co-regulations) among a large number of genes vary between different biological states. We propose a computationally fast procedure for testing the equality of two large covariance matrices when the dimensions of the covariance matrices are much larger than the sample sizes. A distinguishing feature of the new procedure is that it imposes no structural assumptions on the unknown covariance matrices. Hence the test is robust with respect to various complex dependence structures that frequently arise in genomics. We prove that the proposed procedure is asymptotically valid under weak moment conditions. As an interesting application, we derive a new gene clustering algorithm which shares the same nice property of avoiding restrictive structural assumptions for high-dimensional genomics data. Using an asthma gene expression dataset, we illustrate how the new test helps compare the covariance matrices of the genes across different gene sets/pathways between the disease group and the control group, and how the gene clustering algorithm provides new insights on the way gene clustering patterns differ between the two groups. The proposed methods have been implemented in an R-package HDtest and is available on CRAN.Comment: The original title dated back to May 2015 is "Bootstrap Tests on High Dimensional Covariance Matrices with Applications to Understanding Gene Clustering

    Estimation of instrinsic dimension via clustering

    Full text link
    The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

    Information‐Theoretic Clustering and Algorithms

    Get PDF
    Clustering is the task of partitioning objects into clusters on the basis of certain criteria so that objects in the same cluster are similar. Many clustering methods have been proposed in a number of decades. Since clustering results depend on criteria and algorithms, appropriate selection of them is an essential problem. Recently, large sets of users’ behavior logs and text documents are common. These are often presented as high‐dimensional and sparse vectors. This chapter introduces information‐theoretic clustering (ITC), which is appropriate and useful to analyze such a high‐dimensional data, from both theoretical and experimental side. Theoretically, the criterion, generative models, and novel algorithms are shown. Experimentally, it shows the effectiveness and usefulness of ITC for text analysis as an important example

    Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction

    Get PDF
    Applying dimensionality reduction (DR) to large, high-dimensional data sets can be challenging when distinguishing the underlying high-dimensional data clusters in a 2D projection for exploratory analysis. We address this problem by first sharpening the clusters in the original high-dimensional data prior to the DR step using Local Gradient Clustering (LGC). We then project the sharpened data from the high-dimensional space to 2D by a user-selected DR method. The sharpening step aids this method to preserve cluster separation in the resulting 2D projection. With our method, end-users can label each distinct cluster to further analyze an otherwise unlabeled data set. Our `High-Dimensional Sharpened DR' (HD-SDR) method, tested on both synthetic and real-world data sets, is favorable to DR methods with poor cluster separation and yields a better visual cluster separation than these DR methods with no sharpening. Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data. To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.Comment: This paper has been accepted for Information Visualization. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Analysis of Mass Based and Density Based Clustering Techniques on Numerical Datasets

    Get PDF
    Clustering is the techniques adopted by data mining tools across a range of application . It provides several algorithms that can assess large data set based on specific parameters & group related points  . This paper gives comparative analysis of density based clustering algorithms and mass based clustering algorithms. DBSCAN [15] is a base algorithm for density based clustering techniques. One of the advantages of using these techniques is that method does not require the number of clusters to be given a prior and it can detect the clusters of different shapes and sizes from large amount of data which contains noise and outliers. OPTICS [14] on the other hand does not produce a clustering of a data set explicitly, but instead creates an augmented ordering of the database representing its density based clustering structure. Mass based clustering algorithm   mass estimation technique is used (it is alternate of density based clustering) .In Mass based clustering algorithm [22] there are also core regions and noise points are used as a parameter. We analyze the algorithms in terms of the parameters essential for creating meaningful clusters. All the algorithms are tested using numerical data sets for low as well as high dimensional data sets. Keywords: Mass Based (DEMassDBSCAN) ,DBSCAN,OPTICS

    Weighted-covariance factor fuzzy C-means clustering

    Get PDF
    In this paper, we propose a factor weighted fuzzy c-means clustering algorithm. Based on the inverse of a covariance factor, which assesses the collinearity between the centers and samples, this factor takes also into account the compactness of the samples within clusters. The proposed clustering algorithm allows to classify spherical and non-spherical structural clusters, contrary to classical fuzzy c-means algorithm that is only adapted for spherical structural clusters. Compared with other algorithms designed for non-spherical structural clusters, such as Gustafson-Kessel, Gath-Geva or adaptive Mahalanobis distance-based fuzzy c-means clustering algorithms, the proposed algorithm gives better numerical results on artificial and real well known data sets. Moreover, this algorithm can be used for high dimensional data, contrary to other algorithms that require the computation of determinants of large matrices. Application on Mid-Infrared spectra acquired on maize root and aerial parts of Miscanthus for the classification of vegetal biomass shows that this algorithm can successfully be applied on high dimensional data

    An Efficient Approach to Clustering in Large Multimedia Databases with Noise".

    Get PDF
    Abstract Several clustering algorithms can be applied to clustering in large multimedia databases. The effectiveness and efficiency of the existing algorithms, however, is somewhat limited, since clustering in multimedia databases requires clustering high-dimensional feature vectors and since multimedia databases often contain large amounts of noise. In this paper, we therefore introduce a new algorithm to clustering in large multimedia databases called DENCLUE (DENsitybased CLUstEring). The basic idea of our new approach is to model the overall point density analytically as the sum of influence functions of the data points. Clusters can then be identified by determining density-attractors and clusters of arbitrary shape can be easily described by a simple equation based on the overall density function. The advantages of our new approach are (1) it has a firm mathematical basis, (2) it has good clustering properties in data sets with large amounts of noise, (3) it allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets and (4) it is significantly faster than existing algorithms. To demonstrate the effectiveness and efficiency of DENCLUE, we perform a series of experiments on a number of different data sets from CAD and molecular biology. A comparison with DBSCAN shows the superiority of our new approach
    corecore