10 research outputs found
Data segmentation based on the local intrinsic dimension
One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms
Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
Large data sets are often modeled as being noisy samples from probability distributions in R^D, with D large. It has been noticed that oftentimes the support M of these probability distributions seems to be well-approximated by low-dimensional sets, perhaps even by manifolds. We shall consider sets that are locally well approximated by k-dimensional planes, with k << D, with k-dimensional manifolds isometrically embedded in R^D being a special case. Samples from this distribution; are furthermore corrupted by D-dimensional noise. Certain tools from multiscale geometric measure theory and harmonic analysis seem well-suited to be adapted to the study of samples from such probability distributions, in order to yield quantitative geometric information about them. In this paper we introduce and study multiscale covariance matrices, i.e. covariances corresponding to the distribution restricted to a ball of radius r, with a fixed center and varying r, and under rather general geometric assumptions we study how their empirical, noisy counterparts behave. We prove that in the range of scales where these covariance matrices are most informative, the empirical, noisy covariances are close to their expected, noiseless counterparts. In fact, this is true as soon as the number of samples in the balls where the covariance matrices are computed is linear in the intrinsic dimension of M. As an application, we present an algorithm for estimating the intrinsic dimension of M
Foundations of a Multi-way Spectral Clustering Framework for Hybrid Linear Modeling
The problem of Hybrid Linear Modeling (HLM) is to model and segment data
using a mixture of affine subspaces. Different strategies have been proposed to
solve this problem, however, rigorous analysis justifying their performance is
missing. This paper suggests the Theoretical Spectral Curvature Clustering
(TSCC) algorithm for solving the HLM problem, and provides careful analysis to
justify it. The TSCC algorithm is practically a combination of Govindu's
multi-way spectral clustering framework (CVPR 2005) and Ng et al.'s spectral
clustering algorithm (NIPS 2001). The main result of this paper states that if
the given data is sampled from a mixture of distributions concentrated around
affine subspaces, then with high sampling probability the TSCC algorithm
segments well the different underlying clusters. The goodness of clustering
depends on the within-cluster errors, the between-clusters interaction, and a
tuning parameter applied by TSCC. The proof also provides new insights for the
analysis of Ng et al. (NIPS 2001).Comment: 40 pages. Minor changes to the previous version (mainly revised
Sections 2.2 & 2.3, and added references). Accepted to the Journal of
Foundations of Computational Mathematic
NOVEL TECHNIQUES FOR INTRINSIC DIMENSION ESTIMATION
Since the 1950s, the rapid pace of technological advances allows to measure
and record increasing amounts of data, motivating the urgent need to develop
dimensionality reduction systems to be applied on datasets comprising high-
dimensional points.
To this aim, a fundamental information is provided by the intrinsic di-
mension (id) defined by Bennet [1] as the minimum number of parameters
needed to generate a data description by maintaining the \u201cintrinsic\u201d structure
characterizing the dataset, so that the information loss is minimized.
More recently, a quite intuitive definition employed by several authors in the
past has been reported by Bishop in [2] where the author writes that \u201ca set in
D dimensions is said to have an id equal to d if the data lies entirely within a
d-dimensional subspace of D \u201d.
Though more specific and different id definitions have been proposed in dif-
ferent research fieldsthroughout the pattern recognition literature the presently
prevailing id definition views a point set as a sample set uniformly drawn from
an unknown smooth (or locally smooth) manifold structure, eventually embed-
ded in an higher dimensional space through a non-linear smooth mapping; in
this case, the id to be estimated is the manifold\u2019s topological dimension.
Due to the importance of id in several theoretical and practical application
fields, in the last two decades a great deal of research effort has been devoted
to the development of effective id estimators. Though several techniques have
been proposed in literature, the problem is still open for the following main
reasons.
1At first, it must be highlighted that though Lebesgue\u2019s definition of topo-
logical dimension (reported by [5]) is quite clear, in practice its estimation is
difficult if only a finite set of points is available. Therefore, id estimation tech-
niques proposed in literature are either founded on different notions of dimen-
sion (e.g. fractal dimensions) approximating the topological one, or on various
techniques aimed at preserving the characteristics of data-neighborhood distri-
butions, which reflect the topology of the underlying manifold. Besides, the
estimated id value markedly changes as the scale used to analyze the input
dataset changes, and being the number of available points practically limited,
several methods underestimate id when its value is sufficiently high (namely id
10). Other serious problems arise when the dataset is embedded in higher
dimensional spaces through a non-linear map. Finally, the too high computa-
tional complexity of most estimators makes them unpractical when the need is
to process datasets comprising huge amounts of high-dimensional data.
The main subject of this thesis work is the development of efficient and ef-
fective id estimators. Precisely, two novel estimators, named MiND (Minimum
Neighbor Distance estimators of intrinsic dimension, [6]) and DANCo (Dimension-
ality from Angle and Norm Concentration, [4]) are described. The aforemen-
tioned techniques are based on the exploitation of statistics characterizing the
hidden structure of high dimensional spaces, such as the distribution of norms
and angles, which are informative of the id and could therefore be exploited for
its estimation. A simple practical example to show the informatory power of
these features, is the clustering system proposed in [3]; based on the assumption
that each class is represented by one manifold, the clustering procedure codes
the input data by means of local id estimates and features related to them. This
coding allows to obtain reliable results by applying classic and basic clustering
algorithms.
To evaluate the proposed estimators by objectively comparing them with
relevant state-of-the-art techniques, a benchmark framework is proposed. The
need of this framework is highlighted by the fact that in literature each method
has been assessed on different datasets and by employing different evaluation
measures; therefore it is difficult to provide an objective comparison by solely
analyzing the results reported by the authors. Based on this observation, the
proposed benchmark employs publicly available, synthetic and real, datasets
that have been used by several authors in the literature for their interesting, and
challenging, peculiarities. Moreover, some synthetic datasets have been added,
to more deeply test the estimators\u2019 performance on high dimensional datasets
being characterized by similarly high id. The application of this benchmark has
shown to provide an objective comparative assessment in terms of robustness
w.r.t. parameter settings, high dimensional datasets, datasets being character-
ized by an high intrinsic dimension, and noisy datasets. The achieved results
show that DANCo provides the most reliable estimates on both synthetic and real
datasets.
The thesis is organized as follows: in Chapter 1 a brief theoretical description
of the various definitions of dimension is presented, along with the problems re-
lated to id estimation and interesting application domains profitably exploiting
the knowledge of id; in Chapter 2 notable state-of-the-art intrinsic id are sur-
veyed, and grouped according to the employed methods; in Chapter 3 MinD, and
DANCo are described; in Chapter 4, after summarizing mostly used experimental
settings, we propose a benchmark framework and we employ it to objectively
assess and compare relevant intrinsic dimensionality estimators; in Chapter 5
conclusions and open research problems are shortly reported.
References
[1] R. S. Bennett. The Intrinsic Dimensionality of Signal Collections. IEEE
Trans. on Information Theory, IT-15(5):517\u2013525, 1969.
[2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, Oxford, 1995.
[3] P. Campadelli, E. Casiraghi, C. Ceruti, G. Lombardi, and A. Rozza. Local
intrinsic dimensionality based features for clustering. In Alfredo Petrosino,
editor, ICIAP (1), volume 8156 of Lecture Notes in Computer Science, pages
41\u201350. Springer, 2013.
[4] C. Ceruti, S. Bassis, A Rozza, G. Lombardi, E. Casiraghi, and P. Campadelli.
DANCo: an intrinsic Dimensionalty estimator exploiting Angle and Norm
Concentration. Pattern recognition, 2014.
[5] M. Katetov and P. Simon. Origins of dimension theory. Handbook of the
History of General Topology, 1997.
[6] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel
high intrinsic dimensionality estimators. Machine Learning Journal, 89(1-
2):37\u201365, May 2012
Translated poisson mixture model for stratification learning
A framework for the regularized and robust estimation of non-uniform dimensionality and density in high dimensional noisy data is introduced in this work. This leads to learning stratifications, that is, mixture of manifolds representing different characteristics and complexities in the data set. The basic idea relies on modeling the high dimensional sample points as a process of Translated Poisson mixtures, with regularizing restrictions, leading to a model which includes the presence of noise. Theoretical asymptotic results for the model are presented as well. The presentation of the theoretical framework is complemented with artificial and real examples showing the importance of regularized stratification learning in high dimensional data analysis in general and computer vision and image analysis in particular.