Search CORE

2,840 research outputs found

Toward a generic representation of random variables for machine learning

Author: Donnat Philippe
Marti Gautier
Very Philippe
Publication venue
Publication date: 03/09/2015
Field of study

This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.Comment: submitted to Pattern Recognition Letter

arXiv.org e-Print Archive

HAL-Polytechnique

A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series

Author: Donnat Philippe
Marti Gautier
Nielsen Frank
Very Philippe
Publication venue
Publication date: 17/09/2015
Field of study

We present in this paper an empirical framework motivated by the practitioner point of view on stability. The goal is to both assess clustering validity and yield market insights by providing through the data perturbations we propose a multi-view of the assets' clustering behaviour. The perturbation framework is illustrated on an extensive credit default swap time series database available online at www.datagrapple.com.Comment: Accepted at ICMLA 201

arXiv.org e-Print Archive

Crossref

Center-based Clustering under Perturbation Stability

Author: Awasthi Pranjal
Blum Avrim
Sheffet Or
Publication venue
Publication date: 11/08/2011
Field of study

Clustering under most popular objective functions is NP-hard, even to approximate well, and so unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial \cite{Bilu09} suggested an approach aimed at bypassing this computational barrier by using properties of instances one might hope to hold in practice. In particular, they argue that instances in practice should be stable to small perturbations in the metric space and give an efficient algorithm for clustering instances of the Max-Cut problem that are stable to perturbations of size

O(n^{1/2})

. In addition, they conjecture that instances stable to as little as O(1) perturbations should be solvable in polynomial time. In this paper we prove that this conjecture is true for any center-based clustering objective (such as

k

-median,

k

-means, and

k

-center). Specifically, we show we can efficiently find the optimal clustering assuming only stability to factor-3 perturbations of the underlying metric in spaces without Steiner points, and stability to factor

2+\sqrt{3}

perturbations for general metrics. In particular, we show for such instances that the popular Single-Linkage algorithm combined with dynamic programming will find the optimal clustering. We also present NP-hardness results under a weaker but related condition

arXiv.org e-Print Archive

Accelerated Spectral Clustering Using Graph Filtering Of Random Signals

Author: Borgnat Pierre
Gribonval Remi
Puy Gilles
Tremblay Nicolas
Vandergheynst Pierre
Publication venue
Publication date: 29/09/2015
Field of study

We build upon recent advances in graph signal processing to propose a faster spectral clustering algorithm. Indeed, classical spectral clustering is based on the computation of the first k eigenvectors of the similarity matrix' Laplacian, whose computation cost, even for sparse matrices, becomes prohibitive for large datasets. We show that we can estimate the spectral clustering distance matrix without computing these eigenvectors: by graph filtering random signals. Also, we take advantage of the stochasticity of these random vectors to estimate the number of clusters k. We compare our method to classical spectral clustering on synthetic data, and show that it reaches equal performance while being faster by a factor at least two for large datasets

arXiv.org e-Print Archive

HAL-ENS-LYON

Infoscience - École polytechnique fédérale de Lausanne

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models

Author: Liu Han
Roeder Kathryn
Wasserman Larry
Publication venue
Publication date: 01/01/2010
Field of study

A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include

K

-fold cross-validation (

K

-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including

K

-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures

arXiv.org e-Print Archive

CiteSeerX

What are the true clusters?

Author: Hennig Christian
Publication venue
Publication date: 01/01/2015
Field of study

Constructivist philosophy and Hasok Chang's active scientific realism are used to argue that the idea of "truth" in cluster analysis depends on the context and the clustering aims. Different characteristics of clusterings are required in different situations. Researchers should be explicit about on what requirements and what idea of "true clusters" their research is based, because clustering becomes scientific not through uniqueness but through transparent and open communication. The idea of "natural kinds" is a human construct, but it highlights the human experience that the reality outside the observer's control seems to make certain distinctions between categories inevitable. Various desirable characteristics of clusterings and various approaches to define a context-dependent truth are listed, and I discuss what impact these ideas can have on the comparison of clustering methods, and the choice of a clustering methods and related decisions in practice

arXiv.org e-Print Archive

Elsevier - Publisher Connector

UCL Discovery

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Finding True Clusters: On the Importance of Simplicity in Science

Author: Liu Mo
Rochefort-Maranda Guillaume
Publication venue
Publication date: 02/02/2019
Field of study

Parametric and dimensional simplicity are not indicators of truth but the methodological principle that urges us to pay attention to such notions of simplicity is truth conducive}. The truth that we are looking for are specific geometrical shapes and we know which algorithm can find which shape provided that we pay attention to parametric and dimensional simplicity

Clustering Stability: An Overview

Author: von Luxburg Ulrike
Publication venue: 'Now Publishers'
Publication date: 01/01/2009
Field of study

A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications

arXiv.org e-Print Archive

CiteSeerX

Crossref

MPG.PuRe