Search CORE

325 research outputs found

Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs

Author: B Naidan
DD Lewis
DM Blei
DW Jacobs
E Chávez
G Chechik
GR Hjaltason
GT Toussaint
H Samet
L Boytsov
M Aumüller
S Kullback
S Robertson
T Skopal
Y Malkov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/10/2019
Field of study

We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metric-space mapping and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually ineffective, we find that constructing an index using a modified, e.g., symmetrized, distance can improve performance. This observation paves a way to a new line of research of designing index-specific graph-construction distance functions

arXiv.org e-Print Archive

Crossref

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Author: E Chávez
G Casanova
H Jégou
H Kriegel
I Jolliffe
K Smith-Miles
Laurent Amsaleg
M Aumüller
ME Houle
RR Curtin
WB Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms

Author: Alexander Faithfull
Alexandr Andoni
Amsaleg
Bentley
Christiani
Ciaccia
Curtin
Edel
Erik Bernhardsson
Heo
Herlocker
Houle
Hyvönen
Iwasaki
Johnson
Kirner
Kriegel
Laarhoven
LeCun
Levina
Malkov
Martin Aumüller
Pálmason
Van Rijn
Wang
Williams
Zezula
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

Crossref

The IT University of Copenhagen's Repository

High-Dimensional Similarity Search with Quantum-Assisted Variational Autoencoder

Author: Board S. S.
C. K.
Cao Y.
Carreira-Perpinan M. A.
Gionis A.
Hoffman M. D.
Indyk P.
Kamath C.
Long P.
Solano R.
Steinhaeuser K.
Yagoubi D. E.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/06/2020
Field of study

Recent progress in quantum algorithms and hardware indicates the potential importance of quantum computing in the near future. However, finding suitable application areas remains an active area of research. Quantum machine learning is touted as a potential approach to demonstrate quantum advantage within both the gate-model and the adiabatic schemes. For instance, the Quantum-assisted Variational Autoencoder has been proposed as a quantum enhancement to the discrete VAE. We extend on previous work and study the real-world applicability of a QVAE by presenting a proof-of-concept for similarity search in large-scale high-dimensional datasets. While exact and fast similarity search algorithms are available for low dimensional datasets, scaling to high-dimensional data is non-trivial. We show how to construct a space-efficient search index based on the latent space representation of a QVAE. Our experiments show a correlation between the Hamming distance in the embedded space and the Euclidean distance in the original space on the Moderate Resolution Imaging Spectroradiometer (MODIS) dataset. Further, we find real-world speedups compared to linear search and demonstrate memory-efficient scaling to half a billion data points

arXiv.org e-Print Archive

Crossref

PRILJ: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments

Author: Gianvito Pio
Graziella De Martino
Michelangelo Ceci
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

In an era characterized by fast technological progress that introduces new unpredictable scenarios every day, working in the law field may appear very difficult, if not supported by the right tools. In this respect, some systems based on Artificial Intelligence methods have been proposed in the literature, to support several tasks in the legal sector. Following this line of research, in this paper we propose a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments, to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-based methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset provided by EUR-Lex, proves the effectiveness and the efficiency of the proposed method. In particular, its ability of modeling different topics of legal documents, as well as of capturing the semantics of the textual content, appear very beneficial for the considered task, and make PRILJ very robust to the possible presence of noise in the data

Archivio istituzionale della ricerca - Università di Bari