Search CORE

1,639 research outputs found

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Author: E Chávez
G Casanova
H Jégou
H Kriegel
I Jolliffe
K Smith-Miles
Laurent Amsaleg
M Aumüller
ME Houle
RR Curtin
WB Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

The role of local dimensionality measures in benchmarking nearest neighbor search

Author: Ahle
Amsaleg
Amsaleg
Amsaleg
Asuero
Aumüller
Aumüller
Aumüller
Beyer
Casanova
Chávez
Curtin
François
He
Houle
Houle
Houle
Houle
Iwasaki
Johnson
Johnson
Jolliffe
Karger
Kriegel
Li
Malkov
Smith-Miles
Spring
Xiao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concepts of local intrinsic dimensionality (LID), local relative contrast (RC), and query expansion allow to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of the distribution of these dimensionality measures on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. Interactive visualizations are available on the companion website.1 The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

Terminology mining in social media

Author: Karlgren Jussi
Sahlgren Magnus
Publication venue
Publication date: 01/01/2009
Field of study

The highly variable and dynamic word usage in social media presents serious challenges for both research and those commercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exempliﬁes a terminology mining approach for dealing with the productive character of the textual environment in social media. We explore the challenges of practically acquiring new terminology, and of modeling similarity and relatedness of terms from observing realistic amounts of data. We also discuss semantic evolution and density, and investigate novel measures for characterizing the preconditions for terminology mining

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk)

Author: Aumüller Martin
Publication venue
Publication date: 01/01/2020
Field of study

Similarity search problems in high-dimensional data arise in many areas of computer science such as data bases, image analysis, machine learning, and natural language processing. One of the most prominent problems is finding the k nearest neighbors of a data point q ? ?^d in a large set of data points S ? ?^d, under same distance measure such as Euclidean distance. In contrast to lower dimensional settings, we do not know of worst-case efficient data structures for such search problems in high-dimensional data, i.e., data structures that are faster than a linear scan through the data set. However, there is a rich body of (often heuristic) approaches that solve nearest neighbor search problems much faster than such a scan on many real-world data sets. As a necessity, the term solve means that these approaches give approximate results that are close to the true k-nearest neighbors. In this talk, we survey recent approaches to nearest neighbor search and related problems. The talk consists of three parts: (1) What makes nearest neighbor search difficult? (2) How do current state-of-the-art algorithms work? (3) What are recent advances regarding similarity search on GPUs, in distributed settings, or in external memory

Dagstuhl Research Online Publication Server

The IT University of Copenhagen's Repository

Structure identification methods for atomistic simulations of crystalline materials

Author: Stukowski Alexander
Publication venue: 'IOP Publishing'
Publication date: 11/06/2012
Field of study

We discuss existing and new computational analysis techniques to classify local atomic arrangements in large-scale atomistic computer simulations of crystalline solids. This article includes a performance comparison of typical analysis algorithms such as Common Neighbor Analysis, Centrosymmetry Analysis, Bond Angle Analysis, Bond Order Analysis, and Voronoi Analysis. In addition we propose a simple extension to the Common Neighbor Analysis method that makes it suitable for multi-phase systems. Finally, we introduce a new structure identification algorithm, the Neighbor Distance Analysis, that is designed to identify atomic structure units in grain boundaries

arXiv.org e-Print Archive

Crossref

A Harmonic Extension Approach for Collaborative Ranking

Author: Bertozzi Andrea
Kuang Da
Osher Stanley
Shi Zuoqiang
Publication venue
Publication date: 16/02/2016
Field of study

We present a new perspective on graph-based methods for collaborative ranking for recommender systems. Unlike user-based or item-based methods that compute a weighted average of ratings given by the nearest neighbors, or low-rank approximation methods using convex optimization and the nuclear norm, we formulate matrix completion as a series of semi-supervised learning problems, and propagate the known ratings to the missing ones on the user-user or item-item graph globally. The semi-supervised learning problems are expressed as Laplace-Beltrami equations on a manifold, or namely, harmonic extension, and can be discretized by a point integral method. We show that our approach does not impose a low-rank Euclidean subspace on the data points, but instead minimizes the dimension of the underlying manifold. Our method, named LDM (low dimensional manifold), turns out to be particularly effective in generating rankings of items, showing decent computational efficiency and robust ranking quality compared to state-of-the-art methods

arXiv.org e-Print Archive

eScholarship - University of California

A meta-learning configuration framework for graph-based similarity search indexes

Author: Barbon S.
Kaster D. S.
Oyamada R. S.
Shimomura L. C.
Publication venue
Publication date: 01/01/2023
Field of study

Similarity searches retrieve elements in a dataset with similar characteristics to the input query element. Recent works show that graph-based methods have outperformed others in the literature, such as tree-based and hash-based methods. However, graphs are highly parameter-sensitive for indexing and searching, which usually demands extra time for finding a suitable trade-off for specific user requirements. Current approaches to select parameters rely on observing published experimental results or Grid Search procedures. While the former has no guarantees that good settings for a dataset will also perform well on a different one, the latter is computationally expensive and limited to a small range of values. In this work, we propose a meta-learning-based recommender framework capable of providing a suitable graph configuration according to the characteristics of the input dataset. We present two instantiations of the framework: a global instantiation that uses the whole meta-database to train meta-models and a dataset-similarity-based instantiation that relies on clustering to generate meta-models tailored to datasets with similar characteristics. We also developed generic and tuned versions of the instantiations. The generic versions can satisfy user requirements in orders of magnitude faster than the traditional Grid Search. The tuned versions provide more accurate predictions at a higher cost. Our results show that the tuned methods outperform the Grid Search for most cases, providing recommendations close to the optimal one and being a suitable alternative, particularly for more challenging datasets

Archivio istituzionale della ricerca - Università di Trieste