14 research outputs found

    Geodesic distances in the intrinsic dimensionality estimation using packing numbers

    Get PDF
    Dimensionality reduction is a very important tool in data mining. An intrinsic dimensionality of a data set is a key parameter in many dimensionality reduction algorithms. When the intrinsic dimensionality of a data set is known, it is possible to reduce the dimensionality of the data without losing much information. To this end, it is reasonable to find out the intrinsic dimensionality of the data. In this paper, one of the global estimators of intrinsic dimensionality, the packing numbers estimator (PNE), is explored experimentally. We propose the modification of the PNE method that uses geodesic distances in order to improve the estimates of the intrinsic dimensionality by the PNE method

    Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

    Get PDF
    When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state-of-the-art estimators

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information

    Get PDF
    Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis

    Supervised classification via constrained subspace and tensor sparse representation

    Get PDF
    SRC, a supervised classifier via sparse representation, has rapidly gained popularity in recent years and can be adapted to a wide range of applications based on the sparse solution of a linear system. First, we offer an intuitive geometric model called constrained subspace to explain the mechanism of SRC. The constrained subspace model connects the dots of NN, NFL, NS, NM. Then, inspired from the constrained subspace model, we extend SRC to its tensor-based variant, which takes as input samples of high-order tensors which are elements of an algebraic ring. A tensor sparse representation is used for query tensors. We verify in our experiments on several publicly available databases that the tensor-based SRC called tSRC outperforms traditional SRC in classification accuracy. Although demonstrated for image recognition, tSRC is easily adapted to other applications involving underdetermined linear systems

    Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature

    Get PDF
    Large data sets are often modeled as being noisy samples from probability distributions in R^D, with D large. It has been noticed that oftentimes the support M of these probability distributions seems to be well-approximated by low-dimensional sets, perhaps even by manifolds. We shall consider sets that are locally well approximated by k-dimensional planes, with k << D, with k-dimensional manifolds isometrically embedded in R^D being a special case. Samples from this distribution; are furthermore corrupted by D-dimensional noise. Certain tools from multiscale geometric measure theory and harmonic analysis seem well-suited to be adapted to the study of samples from such probability distributions, in order to yield quantitative geometric information about them. In this paper we introduce and study multiscale covariance matrices, i.e. covariances corresponding to the distribution restricted to a ball of radius r, with a fixed center and varying r, and under rather general geometric assumptions we study how their empirical, noisy counterparts behave. We prove that in the range of scales where these covariance matrices are most informative, the empirical, noisy covariances are close to their expected, noiseless counterparts. In fact, this is true as soon as the number of samples in the balls where the covariance matrices are computed is linear in the intrinsic dimension of M. As an application, we present an algorithm for estimating the intrinsic dimension of M

    Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

    Get PDF
    When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant stateof-the-art estimators
    corecore