24 research outputs found
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
This paper reconsiders common benchmarking approaches to nearest neighbor
search. It is shown that the concept of local intrinsic dimensionality (LID)
allows to choose query sets of a wide range of difficulty for real-world
datasets. Moreover, the effect of different LID distributions on the running
time performance of implementations is empirically studied. To this end,
different visualization concepts are introduced that allow to get a more
fine-grained overview of the inner workings of nearest neighbor search
principles. The paper closes with remarks about the diversity of datasets
commonly used for nearest neighbor search benchmarking. It is shown that such
real-world datasets are not diverse: results on a single dataset predict
results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201
On the Selection of Anchors and Targets for Video Hyperlinking
A problem not well understood in video hyperlinking is what qualifies a
fragment as an anchor or target. Ideally, anchors provide good starting points
for navigation, and targets supplement anchors with additional details while
not distracting users with irrelevant, false and redundant information. The
problem is not trivial for intertwining relationship between data
characteristics and user expectation. Imagine that in a large dataset, there
are clusters of fragments spreading over the feature space. The nature of each
cluster can be described by its size (implying popularity) and structure
(implying complexity). A principle way of hyperlinking can be carried out by
picking centers of clusters as anchors and from there reach out to targets
within or outside of clusters with consideration of neighborhood complexity.
The question is which fragments should be selected either as anchors or
targets, in one way to reflect the rich content of a dataset, and meanwhile to
minimize the risk of frustrating user experience. This paper provides some
insights to this question from the perspective of hubness and local intrinsic
dimensionality, which are two statistical properties in assessing the
popularity and complexity of data space. Based these properties, two novel
algorithms are proposed for low-risk automatic selection of anchors and
targets.Comment: ACM International Conference on Multimedia Retrieval (ICMR), 2017.
(Oral
Dimension Estimation Using Random Connection Models
Information about intrinsic dimension is crucial to perform dimensionality
reduction, compress information, design efficient algorithms, and do
statistical adaptation. In this paper we propose an estimator for the intrinsic
dimension of a data set. The estimator is based on binary neighbourhood
information about the observations in the form of two adjacency matrices, and
does not require any explicit distance information. The underlying graph is
modelled according to a subset of a specific random connection model, sometimes
referred to as the Poisson blob model. Computationally the estimator scales
like n log n, and we specify its asymptotic distribution and rate of
convergence. A simulation study on both real and simulated data shows that our
approach compares favourably with some competing methods from the literature,
including approaches that rely on distance information
Adversarial Estimation of Topological Dimension with Harmonic Score Maps
Quantification of the number of variables needed to locally explain complex
data is often the first step to better understanding it. Existing techniques
from intrinsic dimension estimation leverage statistical models to glean this
information from samples within a neighborhood. However, existing methods often
rely on well-picked hyperparameters and ample data as manifold dimension and
curvature increases. Leveraging insight into the fixed point of the score
matching objective as the score map is regularized by its Dirichlet energy, we
show that it is possible to retrieve the topological dimension of the manifold
learned by the score map. We then introduce a novel method to measure the
learned manifold's topological dimension (i.e., local intrinsic dimension)
using adversarial attacks, thereby generating useful interpretations of the
learned manifold.Comment: Accepted to the NeurIPS'23 Workshop on Diffusion Model
Intrinsic dimension estimation for locally undersampled data
Identifying the minimal number of parameters needed to describe a dataset is a challenging problem known in the literature as intrinsic dimension estimation. All the existing intrinsic dimension estimators are not reliable whenever the dataset is locally undersampled, and this is at the core of the so called curse of dimensionality. Here we introduce a new intrinsic dimension estimator that leverages on simple properties of the tangent space of a manifold and extends the usual correlation integral estimator to alleviate the extreme undersampling problem. Based on this insight, we explore a multiscale generalization of the algorithm that is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the intrinsic dimension of extremely curved manifolds. We test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art intrinsic dimension estimators
Data segmentation based on the local intrinsic dimension
One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms
The role of local dimensionality measures in benchmarking nearest neighbor search
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well