594 research outputs found

    Feature Selection For High-Dimensional Clustering

    Full text link
    We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.Comment: 11 pages, 2 figure

    High-dimensional clustering

    Get PDF
    International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume

    High-dimensional Clustering onto Hamiltonian Cycle

    Full text link
    Clustering aims to group unlabelled samples based on their similarities. It has become a significant tool for the analysis of high-dimensional data. However, most of the clustering methods merely generate pseudo labels and thus are unable to simultaneously present the similarities between different clusters and outliers. This paper proposes a new framework called High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above problems. First, HCHC combines global structure with local structure in one objective function for deep clustering, improving the labels as relative probabilities, to mine the similarities between different clusters while keeping the local structure in each cluster. Then, the anchors of different clusters are sorted on the optimal Hamiltonian cycle generated by the cluster similarities and mapped on the circumference of a circle. Finally, a sample with a higher probability of a cluster will be mapped closer to the corresponding anchor. In this way, our framework allows us to appreciate three aspects visually and simultaneously - clusters (formed by samples with high probabilities), cluster similarities (represented as circular distances), and outliers (recognized as dots far away from all clusters). The experiments illustrate the superiority of HCHC

    High Dimensional Clustering with rr-nets

    Full text link
    Clustering, a fundamental task in data science and machine learning, groups a set of objects in such a way that objects in the same cluster are closer to each other than to those in other clusters. In this paper, we consider a well-known structure, so-called rr-nets, which rigorously captures the properties of clustering. We devise algorithms that improve the run-time of approximating rr-nets in high-dimensional spaces with ℓ1\ell_1 and ℓ2\ell_2 metrics from O~(dn2−Θ(Ï”))\tilde{O}(dn^{2-\Theta(\sqrt{\epsilon})}) to O~(dn+n2−α)\tilde{O}(dn + n^{2-\alpha}), where α=Ω(Ï”1/3/log⁥(1/Ï”))\alpha = \Omega({\epsilon^{1/3}}/{\log(1/\epsilon)}). These algorithms are also used to improve a framework that provides approximate solutions to other high dimensional distance problems. Using this framework, several important related problems can also be solved efficiently, e.g., (1+Ï”)(1+\epsilon)-approximate kkth-nearest neighbor distance, (4+Ï”)(4+\epsilon)-approximate Min-Max clustering, (4+Ï”)(4+\epsilon)-approximate kk-center clustering. In addition, we build an algorithm that (1+Ï”)(1+\epsilon)-approximates greedy permutations in time O~((dn+n2−α)⋅log⁥Ί)\tilde{O}((dn + n^{2-\alpha}) \cdot \log{\Phi}) where Ί\Phi is the spread of the input. This algorithm is used to (2+Ï”)(2+\epsilon)-approximate kk-center with the same time complexity.Comment: Accepted by AAAI201

    High-dimensional clustering

    Get PDF
    International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume
    • 

    corecore