35 research outputs found

    Intrinsic dimension of a dataset: what properties does one expect?

    Full text link
    We propose an axiomatic approach to the concept of an intrinsic dimension of a dataset, based on a viewpoint of geometry of high-dimensional structures. Our first axiom postulates that high values of dimension be indicative of the presence of the curse of dimensionality (in a certain precise mathematical sense). The second axiom requires the dimension to depend smoothly on a distance between datasets (so that the dimension of a dataset and that of an approximating principal manifold would be close to each other). The third axiom is a normalization condition: the dimension of the Euclidean nn-sphere \s^n is Θ(n)\Theta(n). We give an example of a dimension function satisfying our axioms, even though it is in general computationally unfeasible, and discuss a computationally cheap function satisfying most but not all of our axioms (the ``intrinsic dimensionality'' of Ch\'avez et al.)Comment: 6 pages, 6 figures, 1 table, latex with IEEE macros, final submission to Proceedings of the 22nd IJCNN (Orlando, FL, August 12-17, 2007

    A New Estimator of Intrinsic Dimension Based on the Multipoint Morisita Index

    Full text link
    The size of datasets has been increasing rapidly both in terms of number of variables and number of events. As a result, the empty space phenomenon and the curse of dimensionality complicate the extraction of useful information. But, in general, data lie on non-linear manifolds of much lower dimension than that of the spaces in which they are embedded. In many pattern recognition tasks, learning these manifolds is a key issue and it requires the knowledge of their true intrinsic dimension. This paper introduces a new estimator of intrinsic dimension based on the multipoint Morisita index. It is applied to both synthetic and real datasets of varying complexities and comparisons with other existing estimators are carried out. The proposed estimator turns out to be fairly robust to sample size and noise, unaffected by edge effects, able to handle large datasets and computationally efficient

    Dimension Detection with Local Homology

    Full text link
    Detecting the dimension of a hidden manifold from a point sample has become an important problem in the current data-driven era. Indeed, estimating the shape dimension is often the first step in studying the processes or phenomena associated to the data. Among the many dimension detection algorithms proposed in various fields, a few can provide theoretical guarantee on the correctness of the estimated dimension. However, the correctness usually requires certain regularity of the input: the input points are either uniformly randomly sampled in a statistical setting, or they form the so-called (ε,δ)(\varepsilon,\delta)-sample which can be neither too dense nor too sparse. Here, we propose a purely topological technique to detect dimensions. Our algorithm is provably correct and works under a more relaxed sampling condition: we do not require uniformity, and we also allow Hausdorff noise. Our approach detects dimension by determining local homology. The computation of this topological structure is much less sensitive to the local distribution of points, which leads to the relaxation of the sampling conditions. Furthermore, by leveraging various developments in computational topology, we show that this local homology at a point zz can be computed \emph{exactly} for manifolds using Vietoris-Rips complexes whose vertices are confined within a local neighborhood of zz. We implement our algorithm and demonstrate the accuracy and robustness of our method using both synthetic and real data sets

    Geodesic distances in the intrinsic dimensionality estimation using packing numbers

    Get PDF
    Dimensionality reduction is a very important tool in data mining. An intrinsic dimensionality of a data set is a key parameter in many dimensionality reduction algorithms. When the intrinsic dimensionality of a data set is known, it is possible to reduce the dimensionality of the data without losing much information. To this end, it is reasonable to find out the intrinsic dimensionality of the data. In this paper, one of the global estimators of intrinsic dimensionality, the packing numbers estimator (PNE), is explored experimentally. We propose the modification of the PNE method that uses geodesic distances in order to improve the estimates of the intrinsic dimensionality by the PNE method

    Supervised classification via constrained subspace and tensor sparse representation

    Get PDF
    SRC, a supervised classifier via sparse representation, has rapidly gained popularity in recent years and can be adapted to a wide range of applications based on the sparse solution of a linear system. First, we offer an intuitive geometric model called constrained subspace to explain the mechanism of SRC. The constrained subspace model connects the dots of NN, NFL, NS, NM. Then, inspired from the constrained subspace model, we extend SRC to its tensor-based variant, which takes as input samples of high-order tensors which are elements of an algebraic ring. A tensor sparse representation is used for query tensors. We verify in our experiments on several publicly available databases that the tensor-based SRC called tSRC outperforms traditional SRC in classification accuracy. Although demonstrated for image recognition, tSRC is easily adapted to other applications involving underdetermined linear systems

    Intrinsic dimension estimation for locally undersampled data

    Get PDF
    Identifying the minimal number of parameters needed to describe a dataset is a challenging problem known in the literature as intrinsic dimension estimation. All the existing intrinsic dimension estimators are not reliable whenever the dataset is locally undersampled, and this is at the core of the so called curse of dimensionality. Here we introduce a new intrinsic dimension estimator that leverages on simple properties of the tangent space of a manifold and extends the usual correlation integral estimator to alleviate the extreme undersampling problem. Based on this insight, we explore a multiscale generalization of the algorithm that is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the intrinsic dimension of extremely curved manifolds. We test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art intrinsic dimension estimators

    ATHMoS: Automated Telemetry Health Monitoring System at GSOC using Outlier Detection and Supervised Machine Learning

    Get PDF
    Knowing which telemetry parameters are behaving accordingly and those which are behaving out of the ordinary is vital information for continued mission success. For a large amount of different parameters, it is not possible to monitor all of them manually. One of the simplest methods of monitoring the behavior of telemetry is the Out Of Limit (OOL) check, which monitors whether a value exceeds its upper or lower limit. A fundamental problem occurs when a telemetry parameter is showing signs of abnormal behavior; yet, the values are not extreme enough for the OOL-check to detect the problem. By the time the OOL threshold is reached, it could be too late for the operators to react. To solve this problem, the Automated Telemetry Health Monitoring System (ATHMoS) is in development at the German Space Operation Center (GSOC). At the heart of the framework is a novel algorithm for statistical outlier detection which makes use of the so-called Intrinsic Dimensionality (ID) of a data set. Using an ID measure as the core data mining technique allows us to not only run ATHMoS on a parameter by parameter basis, but also monitor and flag anomalies for multi-parameter interactions. By aggregating past telemetry data and employing these techniques, ATHMoS employs a supervised machine learning approach to construct three databases: Historic Nominal data, Recent Nominal data and past Anomaly data. Once new telemetry is received, the algorithm makes a distinction between nominal behaviour and new potentially dangerous behaviour; the latter of which is then flagged to mission engineers. ATHMoS continually learns to distinguish between new nominal behavior and true anomaly events throughout the mission lifetime. To this end, we present an overview of the algorithms ATHMoS uses as well an example where we successfully detected both previously unknown, and known anomalies for an ongoing mission at GSOC
    corecore