594 research outputs found
Feature Selection For High-Dimensional Clustering
We present a nonparametric method for selecting informative features in
high-dimensional clustering problems. We start with a screening step that uses
a test for multimodality. Then we apply kernel density estimation and mode
clustering to the selected features. The output of the method consists of a
list of relevant features, and cluster assignments. We provide explicit bounds
on the error rate of the resulting clustering. In addition, we provide the
first error bounds on mode based clustering.Comment: 11 pages, 2 figure
High-dimensional clustering
International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume
High-dimensional Clustering onto Hamiltonian Cycle
Clustering aims to group unlabelled samples based on their similarities. It
has become a significant tool for the analysis of high-dimensional data.
However, most of the clustering methods merely generate pseudo labels and thus
are unable to simultaneously present the similarities between different
clusters and outliers. This paper proposes a new framework called
High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above
problems. First, HCHC combines global structure with local structure in one
objective function for deep clustering, improving the labels as relative
probabilities, to mine the similarities between different clusters while
keeping the local structure in each cluster. Then, the anchors of different
clusters are sorted on the optimal Hamiltonian cycle generated by the cluster
similarities and mapped on the circumference of a circle. Finally, a sample
with a higher probability of a cluster will be mapped closer to the
corresponding anchor. In this way, our framework allows us to appreciate three
aspects visually and simultaneously - clusters (formed by samples with high
probabilities), cluster similarities (represented as circular distances), and
outliers (recognized as dots far away from all clusters). The experiments
illustrate the superiority of HCHC
High Dimensional Clustering with -nets
Clustering, a fundamental task in data science and machine learning, groups a
set of objects in such a way that objects in the same cluster are closer to
each other than to those in other clusters. In this paper, we consider a
well-known structure, so-called -nets, which rigorously captures the
properties of clustering. We devise algorithms that improve the run-time of
approximating -nets in high-dimensional spaces with and
metrics from to , where .
These algorithms are also used to improve a framework that provides approximate
solutions to other high dimensional distance problems. Using this framework,
several important related problems can also be solved efficiently, e.g.,
-approximate th-nearest neighbor distance,
-approximate Min-Max clustering, -approximate
-center clustering. In addition, we build an algorithm that
-approximates greedy permutations in time where is the spread of the input. This
algorithm is used to -approximate -center with the same time
complexity.Comment: Accepted by AAAI201
High-dimensional clustering
International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume
- âŠ