Search CORE

70 research outputs found

Two-level histograms for dealing with outliers and heavy tail distributions

Author: Boullé Marc
Publication venue
Publication date: 09/06/2023
Field of study

Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the underlying data distributions or exploiting a model selection approach. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter and achieves state-of-the art performance w.r.t accuracy; parsimony and computation time. We investigate on the limits of this method in the case of outliers or heavy-tailed distributions. We suggest a two-level heuristic to deal with such cases. The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values. The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram. Extensive experiments show the benefits of the approach.Comment: 30 pages, 47 figure

arXiv.org e-Print Archive

A Triclustering Approach for Time Evolving Graphs

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/12/2012
Field of study

This paper introduces a novel technique to track structures in time evolving graphs. The method is based on a parameter free approach for three-dimensional co-clustering of the source vertices, the target vertices and the time. All these features are simultaneously segmented in order to build time segments and clusters of vertices whose edge distributions are similar and evolve in the same way over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make an a priori discretization. Experiments conducted on a synthetic dataset illustrate the good behaviour of the technique, and a study of a real-life dataset shows the potential of the proposed approach for exploratory data analysis

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL-Paris1

Discovering Patterns in Time-Varying Graphs: A Triclustering Approach

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2018
Field of study

International audienceThis paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis

HAL-Paris1

Triclustering pour la détection de structures temporelles dans les graphes

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: HAL CCSD
Publication date: 17/10/2012
Field of study

International audienceThis paper introduces a novel technique to track structures in time evolving graphs. The method is based on a parameter free approach for three-dimensional co-clustering of the source vertices, the target vertices and the time. All these features are simultaneously segmented in order to build time segments and clusters of vertices whose edge distributions are similar and evolve in the same way over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make an a priori discretization. Experiments conducted on a synthetic dataset illustrate the good behaviour of the technique, and a study of a real-life dataset shows the potential of the proposed approach for exploratory data analysi

HAL-Paris1

Predicting Dangerous Seismic Events in Coal Mines under Distribution Drift

Author: Marc Boullé
Publication venue: 'Polish Information Processing Society PTI'
Publication date: 01/10/2016
Field of study

Crossref

Directory of Open Access Journals

Étude des corrélations spatio-temporelles des appels mobiles en France

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: HAL CCSD
Publication date: 29/01/2013
Field of study

International audienceNous proposons dans cet article de présenter une application d'analyse d'une base de données de grande taille issue du secteur des télécommunications. Le problème consiste à segmenter un territoire et caractériser les zones ainsi définies grâce au comportement des habitants en terme de téléphonie mobile. Nous disposons pour cela d'un réseau d'appels inter-antennes construit pendant une période de cinq mois sur l'ensemble de la France. Nous proposons une analyse en deux phases. La première couple les antennes émettrices dont les appels sont similairement distribués sur les antennes réceptrices et vice versa. Une projection de ces groupes d'antennes sur une carte de France permet une visualisation des corrélations entre la géographie du territoire et le comportement de ses habitants en terme de téléphonie. La seconde phase découpe l'année en périodes entre lesquelles on observe un changement de distributions d'appels sortant des groupes d'antennes. On peut ainsi caractériser l'évolution temporelle du comportement des usagers de mobiles dans chacune des zones du pays

HAL-Paris1

Segmentation géographique par étude d'un journal d'appels téléphoniques

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: HAL CCSD
Publication date: 19/10/2011
Field of study

National audienceDans cet article, il est question de segmentation géographique par l'étude d'un journal d'appels agrégés par ville. Au lieu de réaliser directement un clustering de nœuds, nous proposons ici de faire du coclustering sur les arcs, définis comme des instances bidimensionnelles décrites par deux variables : le nœud source et le nœud cible. Une fois la segmentation optimale obtenue, les clusters sont fusionnés successivement de manière à détériorer le moins possible le modèle de clustering. Des expérimentations ont été menées sur un journal d'appel de l'opérateur de télécommunications Belge Mobistar

HAL-Paris1

Clustering hiérarchique non paramétrique de données fonctionnelles

Author: Boullé Marc
Guigourès Romain
Rossi Fabrice
Publication venue: HAL CCSD
Publication date: 31/01/2012
Field of study

International audienceDans cet article, il est question de clustering de courbes. Nous proposons une méthode non paramétrique qui segmente les courbes en clusters et discrétise en intervalles les variables continues décrivant les points de la courbe. Le produit cartésien de ces partitions forme une grille de données qui est inférée en utilisant une approche Bayésienne de sélection de modèle ne faisant aucune hypothèse concernant les courbes. Enfin, une technique de post-traitement, visant à réduire le nombre de clusters dans le but d'améliorer l'interprétabilité des clusters, est proposée. Elle consiste à fusionner successivement et de façon optimale les clusters, ce qui revient à réaliser une classification hiérarchique ascendante dont la mesure de dissimilarité correspond à la variation du critère. De manière intéressante, cette mesure est en fait une somme pondérée de divergences de Kullback-Leibler entre les distributions des clusters avant et après fusions. L'intérêt de l'approche dans le cadre de l'analyse exploratoire de données fonctionnelles est illustré par un jeu de données artificiel et réel

HAL-Paris1