5 research outputs found

    Automatic topography of high-dimensional data sets by non-parametric density peak clustering

    No full text
    Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets

    Automatic topography of high-dimensional data sets by non-parametric density peak clustering

    Full text link
    Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets

    Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks

    Get PDF
    Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates
    corecore