2,840 research outputs found
Toward a generic representation of random variables for machine learning
This paper presents a pre-processing and a distance which improve the
performance of machine learning algorithms working on independent and
identically distributed stochastic processes. We introduce a novel
non-parametric approach to represent random variables which splits apart
dependency and distribution without losing any information. We also propound an
associated metric leveraging this representation and its statistical estimate.
Besides experiments on synthetic datasets, the benefits of our contribution is
illustrated through the example of clustering financial time series, for
instance prices from the credit default swaps market. Results are available on
the website www.datagrapple.com and an IPython Notebook tutorial is available
at www.datagrapple.com/Tech for reproducible research.Comment: submitted to Pattern Recognition Letter
A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series
We present in this paper an empirical framework motivated by the practitioner
point of view on stability. The goal is to both assess clustering validity and
yield market insights by providing through the data perturbations we propose a
multi-view of the assets' clustering behaviour. The perturbation framework is
illustrated on an extensive credit default swap time series database available
online at www.datagrapple.com.Comment: Accepted at ICMLA 201
Center-based Clustering under Perturbation Stability
Clustering under most popular objective functions is NP-hard, even to
approximate well, and so unlikely to be efficiently solvable in the worst case.
Recently, Bilu and Linial \cite{Bilu09} suggested an approach aimed at
bypassing this computational barrier by using properties of instances one might
hope to hold in practice. In particular, they argue that instances in practice
should be stable to small perturbations in the metric space and give an
efficient algorithm for clustering instances of the Max-Cut problem that are
stable to perturbations of size . In addition, they conjecture that
instances stable to as little as O(1) perturbations should be solvable in
polynomial time. In this paper we prove that this conjecture is true for any
center-based clustering objective (such as -median, -means, and
-center). Specifically, we show we can efficiently find the optimal
clustering assuming only stability to factor-3 perturbations of the underlying
metric in spaces without Steiner points, and stability to factor
perturbations for general metrics. In particular, we show for such instances
that the popular Single-Linkage algorithm combined with dynamic programming
will find the optimal clustering. We also present NP-hardness results under a
weaker but related condition
Accelerated Spectral Clustering Using Graph Filtering Of Random Signals
We build upon recent advances in graph signal processing to propose a faster
spectral clustering algorithm. Indeed, classical spectral clustering is based
on the computation of the first k eigenvectors of the similarity matrix'
Laplacian, whose computation cost, even for sparse matrices, becomes
prohibitive for large datasets. We show that we can estimate the spectral
clustering distance matrix without computing these eigenvectors: by graph
filtering random signals. Also, we take advantage of the stochasticity of these
random vectors to estimate the number of clusters k. We compare our method to
classical spectral clustering on synthetic data, and show that it reaches equal
performance while being faster by a factor at least two for large datasets
Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
A challenging problem in estimating high-dimensional graphical models is to
choose the regularization parameter in a data-dependent way. The standard
techniques include -fold cross-validation (-CV), Akaike information
criterion (AIC), and Bayesian information criterion (BIC). Though these methods
work well for low-dimensional problems, they are not suitable in high
dimensional settings. In this paper, we present StARS: a new stability-based
method for choosing the regularization parameter in high dimensional inference
for undirected graphs. The method has a clear interpretation: we use the least
amount of regularization that simultaneously makes a graph sparse and
replicable under random sampling. This interpretation requires essentially no
conditions. Under mild conditions, we show that StARS is partially sparsistent
in terms of graph estimation: i.e. with high probability, all the true edges
will be included in the selected model even when the graph size diverges with
the sample size. Empirically, the performance of StARS is compared with the
state-of-the-art model selection procedures, including -CV, AIC, and BIC, on
both synthetic data and a real microarray dataset. StARS outperforms all these
competing procedures
What are the true clusters?
Constructivist philosophy and Hasok Chang's active scientific realism are
used to argue that the idea of "truth" in cluster analysis depends on the
context and the clustering aims. Different characteristics of clusterings are
required in different situations. Researchers should be explicit about on what
requirements and what idea of "true clusters" their research is based, because
clustering becomes scientific not through uniqueness but through transparent
and open communication. The idea of "natural kinds" is a human construct, but
it highlights the human experience that the reality outside the observer's
control seems to make certain distinctions between categories inevitable.
Various desirable characteristics of clusterings and various approaches to
define a context-dependent truth are listed, and I discuss what impact these
ideas can have on the comparison of clustering methods, and the choice of a
clustering methods and related decisions in practice
Finding True Clusters: On the Importance of Simplicity in Science
Parametric and dimensional simplicity are not indicators of truth but the methodological principle that urges us to pay attention to such notions of simplicity is truth conducive}. The truth that we are looking for are specific geometrical shapes and we know which algorithm can find which shape provided that we pay attention to parametric and dimensional simplicity
Clustering Stability: An Overview
A popular method for selecting the number of clusters is based on stability
arguments: one chooses the number of clusters such that the corresponding
clustering results are "most stable". In recent years, a series of papers has
analyzed the behavior of this method from a theoretical point of view. However,
the results are very technical and difficult to interpret for non-experts. In
this paper we give a high-level overview about the existing literature on
clustering stability. In addition to presenting the results in a slightly
informal but accessible way, we relate them to each other and discuss their
different implications
- âŠ