165 research outputs found

    Dynamic Clustering of Histogram Data Based on Adaptive Squared Wasserstein Distances

    Full text link
    This paper deals with clustering methods based on adaptive distances for histogram data using a dynamic clustering algorithm. Histogram data describes individuals in terms of empirical distributions. These kind of data can be considered as complex descriptions of phenomena observed on complex objects: images, groups of individuals, spatial or temporal variant data, results of queries, environmental data, and so on. The Wasserstein distance is used to compare two histograms. The Wasserstein distance between histograms is constituted by two components: the first based on the means, and the second, to internal dispersions (standard deviation, skewness, kurtosis, and so on) of the histograms. To cluster sets of histogram data, we propose to use Dynamic Clustering Algorithm, (based on adaptive squared Wasserstein distances) that is a k-means-like algorithm for clustering a set of individuals into KK classes that are apriori fixed. The main aim of this research is to provide a tool for clustering histograms, emphasizing the different contributions of the histogram variables, and their components, to the definition of the clusters. We demonstrate that this can be achieved using adaptive distances. Two kind of adaptive distances are considered: the first takes into account the variability of each component of each descriptor for the whole set of individuals; the second takes into account the variability of each component of each descriptor in each cluster. We furnish interpretative tools of the obtained partition based on an extension of the classical measures (indexes) to the use of adaptive distances in the clustering criterion function. Applications on synthetic and real-world data corroborate the proposed procedure

    Basic statistics for probabilistic symbolic variables: a novel metric-based approach

    Full text link
    In data mining, it is usually to describe a set of individuals using some summaries (means, standard deviations, histograms, confidence intervals) that generalize individual descriptions into a typology description. In this case, data can be described by several values. In this paper, we propose an approach for computing basic statics for such data, and, in particular, for data described by numerical multi-valued variables (interval, histograms, discrete multi-valued descriptions). We propose to treat all numerical multi-valued variables as distributional data, i.e. as individuals described by distributions. To obtain new basic statistics for measuring the variability and the association between such variables, we extend the classic measure of inertia, calculated with the Euclidean distance, using the squared Wasserstein distance defined between probability measures. The distance is a generalization of the Wasserstein distance, that is a distance between quantile functions of two distributions. Some properties of such a distance are shown. Among them, we prove the Huygens theorem of decomposition of the inertia. We show the use of the Wasserstein distance and of the basic statistics presenting a k-means like clustering algorithm, for the clustering of a set of data described by modal numerical variables (distributional variables), on a real data set. Keywords: Wasserstein distance, inertia, dependence, distributional data, modal variables.Comment: 19 pages, 3 figure

    Measure based metrics for aggregated data

    Get PDF
    Aggregated data arises commonly from surveys and censuses where groups of individuals are studied as coherent entities. The aggregated data can take many forms including sets, intervals, distributions and histograms. The data analyst needs to measure the similarity between such aggregated data items and a range of metrics are reported in the literature to achieve this (e.g. the Jaccard metric for sets and the Wasserstein metric for histograms). In this paper, a unifying theory based on measure theory is developed that establishes not only that known metrics are essentially similar but also suggests new metrics

    Archetypes for histogram-valued data

    Get PDF
    Il principale sviluppo innovativo del lavoro è quello di propone una estensione dell'analisi archetipale per dati ad istogramma. Per quanto concerne l'impianto metodologico nell'approccio all'analisi di dati ad istogramma, che sono di natura complessa, il presente lavora utilizza le intuizioni della "Symbolic Data Analysis" (SDA) e le relazioni intrinseche tra dati valutati ad intervallo e dati valutati ad istogramma. Dopo aver discusso la tecnica sviluppata in ambiente Matlab, il suo funzionamento e le sue proprietà su di un esempio di comodo, tale tecnica viene proposta, nella sezione applicativa, come strumento per effettuare una analisi di tipo "benchmarking" quantitativo. Nello specifico, si propongono i principali risultati ottenuti da una applicazione degli archetipi per dati ad istogramma ad un caso di benchmarking interno del sistema scolastico, utilizzando dati provenienti dal test INVALSI relativi all'anno scolastico 2015/2016. In questo contesto l'unità di analisi è considerata essere la singola scuola, definita operativamente attraverso le distribuzioni dei punteggi dei propri alunni valutate, congiuntamente, sotto forma di oggetti simbolici ad istogramma

    Multilevel mixed-type data analysis for validating partitions of scrapie isolates

    Get PDF
    The dissertation arises from a joint study with the Department of Food Safety and Veterinary Public Health of the Istituto Superiore di Sanità. The aim is to investigate and validate the existence of distinct strains of the scrapie disease taking into account the availability of a priori benchmark partition formulated by researchers. Scrapie of small ruminants is caused by prions, which are unconventional infectious agents of proteinaceous nature a ecting humans and animals. Due to the absence of nucleic acids, which precludes direct analysis of strain variation by molecular methods, the presence of di erent sheep scrapie strains is usually investigated by bioassay in laboratory rodents. Data are collected by an experimental study on scrapie conducted at the Istituto Superiore di Sanità by experimental transmission of scrapie isolates to bank voles. We aim to discuss the validation of a given partition in a statistical classification framework using a multi-step procedure. Firstly, we use unsupervised classification to see how alternative clustering results match researchers’ understanding of the heterogeneity of the isolates. We discuss whether and how clustering results can be eventually exploited to extend the preliminary partition elicited by researchers. Then we motivate the subsequent partition validation based on the predictive performance of several supervised classifiers. Our data-driven approach contains two main methodological original contributions. We advocate the use of partition validation measures to investigate a given benchmark partition: firstly we discuss the issue of how the data can be used to evaluate a preliminary benchmark partition and eventually modify it with statistical results to find a conclusive partition that could be used as a “gold standard” in future studies. Moreover, collected data have a multilevel structure and for each lower-level unit, mixed-type data are available. Each step in the procedure is then adapted to deal with multilevel mixed-type data. We extend distance-based clustering algorithms to deal with multilevel mixed-type data. Whereas in supervised classification we propose a two-step approach to classify the higher-level units starting from the lower-level observations. In this framework, we also need to define an ad-hoc cross validation algorithm
    • …
    corecore