Search CORE

594 research outputs found

Feature Selection For High-Dimensional Clustering

Author: Azizyan Martin
Singh Aarti
Wasserman Larry
Publication venue
Publication date: 09/06/2014
Field of study

We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.Comment: 11 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

High-dimensional clustering

Author: Biernacki Christophe
Maugis Cathy
Publication venue: HAL CCSD
Publication date: 19/09/2017
Field of study

International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL Descartes

HAL-INSA Toulouse

High-dimensional Clustering onto Hamiltonian Cycle

Author: Cheng Shenghui
Huang Tianyi
Li Stan Z.
Zhang Zhengjun
Publication venue
Publication date: 27/04/2023
Field of study

Clustering aims to group unlabelled samples based on their similarities. It has become a significant tool for the analysis of high-dimensional data. However, most of the clustering methods merely generate pseudo labels and thus are unable to simultaneously present the similarities between different clusters and outliers. This paper proposes a new framework called High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above problems. First, HCHC combines global structure with local structure in one objective function for deep clustering, improving the labels as relative probabilities, to mine the similarities between different clusters while keeping the local structure in each cluster. Then, the anchors of different clusters are sorted on the optimal Hamiltonian cycle generated by the cluster similarities and mapped on the circumference of a circle. Finally, a sample with a higher probability of a cluster will be mapped closer to the corresponding anchor. In this way, our framework allows us to appreciate three aspects visually and simultaneously - clusters (formed by samples with high probabilities), cluster similarities (represented as circular distances), and outliers (recognized as dots far away from all clusters). The experiments illustrate the superiority of HCHC

arXiv.org e-Print Archive

High Dimensional Clustering with $r$ -nets

Author: Avarikioti Georgia
Ryser Alain
Wang Yuyi
Wattenhofer Roger
Publication venue
Publication date: 06/11/2018
Field of study

Clustering, a fundamental task in data science and machine learning, groups a set of objects in such a way that objects in the same cluster are closer to each other than to those in other clusters. In this paper, we consider a well-known structure, so-called

r

-nets, which rigorously captures the properties of clustering. We devise algorithms that improve the run-time of approximating

r

-nets in high-dimensional spaces with

\ell_1

and

\ell_2

metrics from

\tilde{O}(dn^{2-\Theta(\sqrt{\epsilon})})

\tilde{O}(dn + n^{2-\alpha})

, where

\alpha = \Omega({\epsilon^{1/3}}/{\log(1/\epsilon)})

. These algorithms are also used to improve a framework that provides approximate solutions to other high dimensional distance problems. Using this framework, several important related problems can also be solved efficiently, e.g.,