1,554 research outputs found

    Estimation of instrinsic dimension via clustering

    Full text link
    The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

    The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch

    Get PDF
    Recent and forthcoming advances in instrumentation, and giant new surveys, are creating astronomical data sets that are not amenable to the methods of analysis familiar to astronomers. Traditional methods are often inadequate not merely because of the size in bytes of the data sets, but also because of the complexity of modern data sets. Mathematical limitations of familiar algorithms and techniques in dealing with such data sets create a critical need for new paradigms for the representation, analysis and scientific visualization (as opposed to illustrative visualization) of heterogeneous, multiresolution data across application domains. Some of the problems presented by the new data sets have been addressed by other disciplines such as applied mathematics, statistics and machine learning and have been utilized by other sciences such as space-based geosciences. Unfortunately, valuable results pertaining to these problems are mostly to be found only in publications outside of astronomy. Here we offer brief overviews of a number of concepts, techniques and developments, some "old" and some new. These are generally unknown to most of the astronomical community, but are vital to the analysis and visualization of complex datasets and images. In order for astronomers to take advantage of the richness and complexity of the new era of data, and to be able to identify, adopt, and apply new solutions, the astronomical community needs a certain degree of awareness and understanding of the new concepts. One of the goals of this paper is to help bridge the gap between applied mathematics, artificial intelligence and computer science on the one side and astronomy on the other.Comment: 24 pages, 8 Figures, 1 Table. Accepted for publication: "Advances in Astronomy, special issue "Robotic Astronomy

    Methods for Estimation of Intrinsic Dimensionality

    Get PDF
    Dimension reduction is an important tool used to describe the structure of complex data (explicitly or implicitly) through a small but sufficient number of variables, and thereby make data analysis more efficient. It is also useful for visualization purposes. Dimension reduction helps statisticians to overcome the ‘curse of dimensionality’. However, most dimension reduction techniques require the intrinsic dimension of the low-dimensional subspace to be fixed in advance. The availability of reliable intrinsic dimension (ID) estimation techniques is of major importance. The main goal of this thesis is to develop algorithms for determining the intrinsic dimensions of recorded data sets in a nonlinear context. Whilst this is a well-researched topic for linear planes, based mainly on principal components analysis, relatively little attention has been paid to ways of estimating this number for non–linear variable interrelationships. The proposed algorithms here are based on existing concepts that can be categorized into local methods, relying on randomly selected subsets of a recorded variable set, and global methods, utilizing the entire data set. This thesis provides an overview of ID estimation techniques, with special consideration given to recent developments in non–linear techniques, such as charting manifold and fractal–based methods. Despite their nominal existence, the practical implementation of these techniques is far from straightforward. The intrinsic dimension is estimated via Brand’s algorithm by examining the growth point process, which counts the number of points in hyper-spheres. The estimation needs to determine the starting point for each hyper-sphere. In this thesis we provide settings for selecting starting points which work well for most data sets. Additionally we propose approaches for estimating dimensionality via Brand’s algorithm, the Dip method and the Regression method. Other approaches are proposed for estimating the intrinsic dimension by fractal dimension estimation methods, which exploit the intrinsic geometry of a data set. The most popular concept from this family of methods is the correlation dimension, which requires the estimation of the correlation integral for a ball of radius tending to 0. In this thesis we propose new approaches to approximate the correlation integral in this limit. The new approaches are the Intercept method, the Slop method and the Polynomial method. In addition we propose a new approach, a localized global method, which could be defined as a local version of global ID methods. The objective of the localized global approach is to improve the algorithm based on a local ID method, which could significantly reduce the negative bias. Experimental results on real world and simulated data are used to demonstrate the algorithms and compare them to other methodology. A simulation study which verifies the effectiveness of the proposed methods is also provided. Finally, these algorithms are contrasted using a recorded data set from an industrial melter process

    Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

    Get PDF
    When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state-of-the-art estimators
    corecore