325 research outputs found
Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs
We demonstrate that a graph-based search algorithm-relying on the
construction of an approximate neighborhood graph-can directly work with
challenging non-metric and/or non-symmetric distances without resorting to
metric-space mapping and/or distance symmetrization, which, in turn, lead to
substantial performance degradation. Although the straightforward metrization
and symmetrization is usually ineffective, we find that constructing an index
using a modified, e.g., symmetrized, distance can improve performance. This
observation paves a way to a new line of research of designing index-specific
graph-construction distance functions
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
This paper reconsiders common benchmarking approaches to nearest neighbor
search. It is shown that the concept of local intrinsic dimensionality (LID)
allows to choose query sets of a wide range of difficulty for real-world
datasets. Moreover, the effect of different LID distributions on the running
time performance of implementations is empirically studied. To this end,
different visualization concepts are introduced that allow to get a more
fine-grained overview of the inner workings of nearest neighbor search
principles. The paper closes with remarks about the diversity of datasets
commonly used for nearest neighbor search benchmarking. It is shown that such
real-world datasets are not diverse: results on a single dataset predict
results on all other datasets well.Comment: Preprint of the paper accepted at SISAP 201
High-Dimensional Similarity Search with Quantum-Assisted Variational Autoencoder
Recent progress in quantum algorithms and hardware indicates the potential
importance of quantum computing in the near future. However, finding suitable
application areas remains an active area of research. Quantum machine learning
is touted as a potential approach to demonstrate quantum advantage within both
the gate-model and the adiabatic schemes. For instance, the Quantum-assisted
Variational Autoencoder has been proposed as a quantum enhancement to the
discrete VAE. We extend on previous work and study the real-world applicability
of a QVAE by presenting a proof-of-concept for similarity search in large-scale
high-dimensional datasets. While exact and fast similarity search algorithms
are available for low dimensional datasets, scaling to high-dimensional data is
non-trivial. We show how to construct a space-efficient search index based on
the latent space representation of a QVAE. Our experiments show a correlation
between the Hamming distance in the embedded space and the Euclidean distance
in the original space on the Moderate Resolution Imaging Spectroradiometer
(MODIS) dataset. Further, we find real-world speedups compared to linear search
and demonstrate memory-efficient scaling to half a billion data points
PRILJ: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments
In an era characterized by fast technological progress that introduces new unpredictable scenarios every day, working in the law field may appear very difficult, if not supported by the right tools. In this respect, some systems based on Artificial Intelligence methods have been proposed in the literature, to support several tasks in the legal sector. Following this line of research, in this paper we propose a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments, to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-based methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset provided by EUR-Lex, proves the effectiveness and the efficiency of the proposed method. In particular, its ability of modeling different topics of legal documents, as well as of capturing the semantics of the textual content, appear very beneficial for the considered task, and make PRILJ very robust to the possible presence of noise in the data
- …