Search CORE

24 research outputs found

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Author: E Chávez
G Casanova
H Jégou
H Kriegel
I Jolliffe
K Smith-Miles
Laurent Amsaleg
M Aumüller
ME Houle
RR Curtin
WB Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

On the Selection of Anchors and Targets for Video Hyperlinking

Author: Cheng Zhi-Qi
Ngo Chong-Wah
Wu Xiao
Zhang Hao
Publication venue
Publication date: 14/04/2018
Field of study

A problem not well understood in video hyperlinking is what qualifies a fragment as an anchor or target. Ideally, anchors provide good starting points for navigation, and targets supplement anchors with additional details while not distracting users with irrelevant, false and redundant information. The problem is not trivial for intertwining relationship between data characteristics and user expectation. Imagine that in a large dataset, there are clusters of fragments spreading over the feature space. The nature of each cluster can be described by its size (implying popularity) and structure (implying complexity). A principle way of hyperlinking can be carried out by picking centers of clusters as anchors and from there reach out to targets within or outside of clusters with consideration of neighborhood complexity. The question is which fragments should be selected either as anchors or targets, in one way to reflect the rich content of a dataset, and meanwhile to minimize the risk of frustrating user experience. This paper provides some insights to this question from the perspective of hubness and local intrinsic dimensionality, which are two statistical properties in assessing the popularity and complexity of data space. Based these properties, two novel algorithms are proposed for low-risk automatic selection of anchors and targets.Comment: ACM International Conference on Multimedia Retrieval (ICMR), 2017. (Oral

arXiv.org e-Print Archive

Dimension Estimation Using Random Connection Models

Author: Mandjes Michel
Serra Paulo
Publication venue
Publication date: 01/11/2017
Field of study

Information about intrinsic dimension is crucial to perform dimensionality reduction, compress information, design efficient algorithms, and do statistical adaptation. In this paper we propose an estimator for the intrinsic dimension of a data set. The estimator is based on binary neighbourhood information about the observations in the form of two adjacency matrices, and does not require any explicit distance information. The underlying graph is modelled according to a subset of a specific random connection model, sometimes referred to as the Poisson blob model. Computationally the estimator scales like n log n, and we specify its asymptotic distribution and rate of convergence. A simulation study on both real and simulated data shows that our approach compares favourably with some competing methods from the literature, including approaches that rely on distance information

arXiv.org e-Print Archive

Pure OAI Repository

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Adversarial Estimation of Topological Dimension with Harmonic Score Maps

Author: Darwin Cameron
Li Hai
Liu Frank
Yeats Eric
Publication venue
Publication date: 11/12/2023
Field of study

Quantification of the number of variables needed to locally explain complex data is often the first step to better understanding it. Existing techniques from intrinsic dimension estimation leverage statistical models to glean this information from samples within a neighborhood. However, existing methods often rely on well-picked hyperparameters and ample data as manifold dimension and curvature increases. Leveraging insight into the fixed point of the score matching objective as the score map is regularized by its Dirichlet energy, we show that it is possible to retrieve the topological dimension of the manifold learned by the score map. We then introduce a novel method to measure the learned manifold's topological dimension (i.e., local intrinsic dimension) using adversarial attacks, thereby generating useful interpretations of the learned manifold.Comment: Accepted to the NeurIPS'23 Workshop on Diffusion Model

arXiv.org e-Print Archive

Intrinsic dimension estimation for locally undersampled data

Author: M. Gherardi
P. Rotondo
V. Erba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Identifying the minimal number of parameters needed to describe a dataset is a challenging problem known in the literature as intrinsic dimension estimation. All the existing intrinsic dimension estimators are not reliable whenever the dataset is locally undersampled, and this is at the core of the so called curse of dimensionality. Here we introduce a new intrinsic dimension estimator that leverages on simple properties of the tangent space of a manifold and extends the usual correlation integral estimator to alleviate the extreme undersampling problem. Based on this insight, we explore a multiscale generalization of the algorithm that is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the intrinsic dimension of extremely curved manifolds. We test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art intrinsic dimension estimators

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Università degli Studi di Parma

AIR Universita degli studi di Milano

Data segmentation based on the local intrinsic dimension

Author: Allegra M.
Denti F.
Facco E.
Laio A.
Mira A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms

arXiv.org e-Print Archive

PubliCatt

Archivio istituzionale della ricerca - Università dell'Insubria

Sissa Digital Library

Archivio istituzionale della ricerca - Università di Padova

The role of local dimensionality measures in benchmarking nearest neighbor search

Author: Ahle
Amsaleg
Amsaleg
Amsaleg
Asuero
Aumüller
Aumüller
Aumüller
Beyer
Casanova
Chávez
Curtin
François
He
Houle
Houle
Houle
Houle
Iwasaki
Johnson
Johnson
Jolliffe
Karger
Kriegel
Li
Malkov
Smith-Miles
Spring
Xiao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova