Search CORE

165 research outputs found

Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances

Author: Ahmad
Antonio Irpino
Bock
Calinski
Calo
Celeux
Chan
Chen
Clark
Cuesta-Albertos
De Carvalho
De Carvalho
De Carvalho
De Carvalho
De Souza
Deng
Diday
Diday
Francisco de A.T. De Carvalho
Friedman
Frigui
Gibbs
Huang
Hubert
Irpino
Irpino
Jain
Jing
Johnson
Levina
Mallows
Milligan
Rosanna Verde
Rubner
Rüshendorff
Terada
Tsai
Verde
Verde
Verde
Villani
Vrac
Xu
Publication venue: 'Elsevier BV'
Publication date: 07/10/2011
Field of study

This paper deals with clustering methods based on adaptive distances for histogram data using a dynamic clustering algorithm. Histogram data describes individuals in terms of empirical distributions. These kind of data can be considered as complex descriptions of phenomena observed on complex objects: images, groups of individuals, spatial or temporal variant data, results of queries, environmental data, and so on. The Wasserstein distance is used to compare two histograms. The Wasserstein distance between histograms is constituted by two components: the first based on the means, and the second, to internal dispersions (standard deviation, skewness, kurtosis, and so on) of the histograms. To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into

K

classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances. Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure

arXiv.org e-Print Archive

Crossref

Basic statistics for probabilistic symbolic variables: a novel metric-based approach

Author: Irpino Antonio
Verde Rosanna
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/12/2013
Field of study

In data mining, it is usually to describe a set of individuals using some summaries (means, standard deviations, histograms, confidence intervals) that generalize individual descriptions into a typology description. In this case, data can be described by several values. In this paper, we propose an approach for computing basic statics for such data, and, in particular, for data described by numerical multi-valued variables (interval, histograms, discrete multi-valued descriptions). We propose to treat all numerical multi-valued variables as distributional data, i.e. as individuals described by distributions. To obtain new basic statistics for measuring the variability and the association between such variables, we extend the classic measure of inertia, calculated with the Euclidean distance, using the squared Wasserstein distance defined between probability measures. The distance is a generalization of the Wasserstein distance, that is a distance between quantile functions of two distributions. Some properties of such a distance are shown. Among them, we prove the Huygens theorem of decomposition of the inertia. We show the use of the Wasserstein distance and of the basic statistics presenting a k-means like clustering algorithm, for the clustering of a set of data described by modal numerical variables (distributional variables), on a real data set. Keywords: Wasserstein distance, inertia, dependence, distributional data, modal variables.Comment: 19 pages, 3 figure

arXiv.org e-Print Archive

Archivio Istituzionale della Ricerca - Università degli Studi della Campania "Luigi Vanvitelli"

Measure based metrics for aggregated data

Author: Rayward-Smith V. J.
Publication venue: 'IOS Press'
Publication date: 01/01/2011
Field of study

Aggregated data arises commonly from surveys and censuses where groups of individuals are studied as coherent entities. The aggregated data can take many forms including sets, intervals, distributions and histograms. The data analyst needs to measure the similarity between such aggregated data items and a range of metrics are reported in the literature to achieve this (e.g. the Jaccard metric for sets and the Wasserstein metric for histograms). In this paper, a unifying theory based on measure theory is developed that establishes not only that known metrics are essentially similar but also suggests new metrics

University of East Anglia digital repository

Archetypes for histogram-valued data

Author: Santelli Francesco
Publication venue
Publication date: 01/12/2018
Field of study

Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma

Università degli Studi di Napoli Federico Il Open Archive

Multilevel mixed-type data analysis for validating partitions of scrapie isolates

Author: Rocco Giorgia
Publication venue
Publication date: 25/09/2017
Field of study

The dissertation arises from a joint study with the Department of Food Safety and Veterinary Public Health of the Istituto Superiore di Sanità. The aim is to investigate and validate the existence of distinct strains of the scrapie disease taking into account the availability of a priori benchmark partition formulated by researchers. Scrapie of small ruminants is caused by prions, which are unconventional infectious agents of proteinaceous nature a ecting humans and animals. Due to the absence of nucleic acids, which precludes direct analysis of strain variation by molecular methods, the presence of di erent sheep scrapie strains is usually investigated by bioassay in laboratory rodents. Data are collected by an experimental study on scrapie conducted at the Istituto Superiore di Sanità by experimental transmission of scrapie isolates to bank voles. We aim to discuss the validation of a given partition in a statistical classification framework using a multi-step procedure. Firstly, we use unsupervised classification to see how alternative clustering results match researchers’ understanding of the heterogeneity of the isolates. We discuss whether and how clustering results can be eventually exploited to extend the preliminary partition elicited by researchers. Then we motivate the subsequent partition validation based on the predictive performance of several supervised classifiers. Our data-driven approach contains two main methodological original contributions. We advocate the use of partition validation measures to investigate a given benchmark partition: firstly we discuss the issue of how the data can be used to evaluate a preliminary benchmark partition and eventually modify it with statistical results to find a conclusive partition that could be used as a “gold standard” in future studies. Moreover, collected data have a multilevel structure and for each lower-level unit, mixed-type data are available. Each step in the procedure is then adapted to deal with multilevel mixed-type data. We extend distance-based clustering algorithms to deal with multilevel mixed-type data. Whereas in supervised classification we propose a two-step approach to classify the higher-level units starting from the lower-level observations. In this framework, we also need to define an ad-hoc cross validation algorithm

Archivio della ricerca- Università di Roma La Sapienza