35 research outputs found
Intrinsic dimension of a dataset: what properties does one expect?
We propose an axiomatic approach to the concept of an intrinsic dimension of
a dataset, based on a viewpoint of geometry of high-dimensional structures. Our
first axiom postulates that high values of dimension be indicative of the
presence of the curse of dimensionality (in a certain precise mathematical
sense). The second axiom requires the dimension to depend smoothly on a
distance between datasets (so that the dimension of a dataset and that of an
approximating principal manifold would be close to each other). The third axiom
is a normalization condition: the dimension of the Euclidean -sphere \s^n
is . We give an example of a dimension function satisfying our
axioms, even though it is in general computationally unfeasible, and discuss a
computationally cheap function satisfying most but not all of our axioms (the
``intrinsic dimensionality'' of Ch\'avez et al.)Comment: 6 pages, 6 figures, 1 table, latex with IEEE macros, final submission
to Proceedings of the 22nd IJCNN (Orlando, FL, August 12-17, 2007
A New Estimator of Intrinsic Dimension Based on the Multipoint Morisita Index
The size of datasets has been increasing rapidly both in terms of number of
variables and number of events. As a result, the empty space phenomenon and the
curse of dimensionality complicate the extraction of useful information. But,
in general, data lie on non-linear manifolds of much lower dimension than that
of the spaces in which they are embedded. In many pattern recognition tasks,
learning these manifolds is a key issue and it requires the knowledge of their
true intrinsic dimension. This paper introduces a new estimator of intrinsic
dimension based on the multipoint Morisita index. It is applied to both
synthetic and real datasets of varying complexities and comparisons with other
existing estimators are carried out. The proposed estimator turns out to be
fairly robust to sample size and noise, unaffected by edge effects, able to
handle large datasets and computationally efficient
Dimension Detection with Local Homology
Detecting the dimension of a hidden manifold from a point sample has become
an important problem in the current data-driven era. Indeed, estimating the
shape dimension is often the first step in studying the processes or phenomena
associated to the data. Among the many dimension detection algorithms proposed
in various fields, a few can provide theoretical guarantee on the correctness
of the estimated dimension. However, the correctness usually requires certain
regularity of the input: the input points are either uniformly randomly sampled
in a statistical setting, or they form the so-called
-sample which can be neither too dense nor too sparse.
Here, we propose a purely topological technique to detect dimensions. Our
algorithm is provably correct and works under a more relaxed sampling
condition: we do not require uniformity, and we also allow Hausdorff noise. Our
approach detects dimension by determining local homology. The computation of
this topological structure is much less sensitive to the local distribution of
points, which leads to the relaxation of the sampling conditions. Furthermore,
by leveraging various developments in computational topology, we show that this
local homology at a point can be computed \emph{exactly} for manifolds
using Vietoris-Rips complexes whose vertices are confined within a local
neighborhood of . We implement our algorithm and demonstrate the accuracy
and robustness of our method using both synthetic and real data sets
Geodesic distances in the intrinsic dimensionality estimation using packing numbers
Dimensionality reduction is a very important tool in data mining. An intrinsic dimensionality of a data set is a key parameter in many dimensionality reduction algorithms. When the intrinsic dimensionality of a data set is known, it is possible to reduce the dimensionality of the data without losing much information. To this end, it is reasonable to find out the intrinsic dimensionality of the data. In this paper, one of the global estimators of intrinsic dimensionality, the packing numbers estimator (PNE), is explored experimentally. We propose the modification of the PNE method that uses geodesic distances in order to improve the estimates of the intrinsic dimensionality by the PNE method
Supervised classification via constrained subspace and tensor sparse representation
SRC, a supervised classifier via sparse representation,
has rapidly gained popularity in recent years and can be
adapted to a wide range of applications based on the sparse
solution of a linear system. First, we offer an intuitive geometric
model called constrained subspace to explain the mechanism
of SRC. The constrained subspace model connects the dots
of NN, NFL, NS, NM. Then, inspired from the constrained
subspace model, we extend SRC to its tensor-based variant,
which takes as input samples of high-order tensors which are
elements of an algebraic ring. A tensor sparse representation is
used for query tensors. We verify in our experiments on several
publicly available databases that the tensor-based SRC called
tSRC outperforms traditional SRC in classification accuracy.
Although demonstrated for image recognition, tSRC is easily
adapted to other applications involving underdetermined linear
systems
Intrinsic dimension estimation for locally undersampled data
Identifying the minimal number of parameters needed to describe a dataset is a challenging problem known in the literature as intrinsic dimension estimation. All the existing intrinsic dimension estimators are not reliable whenever the dataset is locally undersampled, and this is at the core of the so called curse of dimensionality. Here we introduce a new intrinsic dimension estimator that leverages on simple properties of the tangent space of a manifold and extends the usual correlation integral estimator to alleviate the extreme undersampling problem. Based on this insight, we explore a multiscale generalization of the algorithm that is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the intrinsic dimension of extremely curved manifolds. We test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art intrinsic dimension estimators
ATHMoS: Automated Telemetry Health Monitoring System at GSOC using Outlier Detection and Supervised Machine Learning
Knowing which telemetry parameters are behaving accordingly and those which are behaving out of the ordinary is vital information for continued mission success. For a large amount of different parameters, it is not possible to monitor all of them manually. One of the simplest methods of monitoring the behavior of telemetry is the Out Of Limit (OOL) check, which monitors whether a value exceeds its upper or lower limit. A fundamental problem occurs when a telemetry parameter is showing signs of abnormal behavior; yet, the values are not extreme enough for the OOL-check to detect the problem. By the time the OOL threshold is reached, it could be too late for the operators to react.
To solve this problem, the Automated Telemetry Health Monitoring System (ATHMoS) is in development at the German Space Operation Center (GSOC). At the heart of the framework is a novel algorithm for statistical outlier detection which makes use of the so-called Intrinsic Dimensionality (ID) of a data set. Using an ID measure as the core data mining technique allows us to not only run ATHMoS on a parameter by parameter basis, but also monitor and flag anomalies for multi-parameter interactions.
By aggregating past telemetry data and employing these techniques, ATHMoS employs a supervised machine learning approach to construct three databases: Historic Nominal data, Recent Nominal data and past Anomaly data. Once new telemetry is received, the algorithm makes a distinction between nominal behaviour and new potentially dangerous behaviour; the latter of which is then flagged to mission engineers. ATHMoS continually learns to distinguish between new nominal behavior and true anomaly events throughout the mission lifetime. To this end, we present an overview of the algorithms ATHMoS uses as well an example where we successfully detected both previously unknown, and known anomalies for an ongoing mission at GSOC