313 research outputs found
Approximate Near Neighbors for General Symmetric Norms
We show that every symmetric normed space admits an efficient nearest
neighbor search data structure with doubly-logarithmic approximation.
Specifically, for every , , and every -dimensional
symmetric norm , there exists a data structure for
-approximate nearest neighbor search over
for -point datasets achieving query time and
space. The main technical ingredient of the algorithm is a
low-distortion embedding of a symmetric norm into a low-dimensional iterated
product of top- norms.
We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur
Distance-Sensitive Hashing
Locality-sensitive hashing (LSH) is an important tool for managing
high-dimensional noisy or uncertain data, for example in connection with data
cleaning (similarity join) and noise-robust search (similarity search).
However, for a number of problems the LSH framework is not known to yield good
solutions, and instead ad hoc solutions have been designed for particular
similarity and distance measures. For example, this is true for
output-sensitive similarity search/join, and for indexes supporting annulus
queries that aim to report a point close to a certain given distance from the
query point.
In this paper we initiate the study of distance-sensitive hashing (DSH), a
generalization of LSH that seeks a family of hash functions such that the
probability of two points having the same hash value is a given function of the
distance between them. More precisely, given a distance space and a "collision probability function" (CPF) we seek a distribution over pairs of functions
such that for every pair of points the collision
probability is . Locality-sensitive
hashing is the study of how fast a CPF can decrease as the distance grows. For
many spaces, can be made exponentially decreasing even if we restrict
attention to the symmetric case where . We show that the asymmetry
achieved by having a pair of functions makes it possible to achieve CPFs that
are, for example, increasing or unimodal, and show how this leads to principled
solutions to problems not addressed by the LSH framework. This includes a novel
application to privacy-preserving distance estimation. We believe that the DSH
framework will find further applications in high-dimensional data management.Comment: Accepted at PODS'18. Abstract shortened due to character limi
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a
general technique for constructing a data structure to answer approximate near
neighbor queries by using a distribution over locality-sensitive
hash functions that partition space. For a collection of points, after
preprocessing, the query time is dominated by evaluations
of hash functions from and hash table lookups and
distance computations where is determined by the
locality-sensitivity properties of . It follows from a recent
result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive
hash functions can be reduced to , leaving the query time to be
dominated by distance computations and
additional word-RAM operations. We state this result as a general framework and
provide a simpler analysis showing that the number of lookups and distance
computations closely match the Indyk-Motwani framework, making it a viable
replacement in practice. Using ideas from another locality-sensitive hashing
framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of
additional word-RAM operations to .Comment: 15 pages, 3 figure
Taylor Polynomial Estimator for Estimating Frequency Moments
We present a randomized algorithm for estimating the th moment of
the frequency vector of a data stream in the general update (turnstile) model
to within a multiplicative factor of , for , with high
constant confidence. For , the algorithm uses space words. This
improves over the current bound of
words by Andoni et. al. in \cite{ako:arxiv10}. Our space upper bound matches
the lower bound of Li and Woodruff \cite{liwood:random13} for and the lower bound of Andoni et. al. \cite{anpw:icalp13}
for .Comment: Supercedes arXiv:1104.4552. Extended Abstract of this paper to appear
in Proceedings of ICALP 201
On the segmentation and classification of hand radiographs
This research is part of a wider project to build predictive models of bone age using hand radiograph images. We examine ways of finding the outline of a hand from an X-ray as the first stage in segmenting the image into constituent bones. We assess a variety of algorithms including contouring, which has not previously been used in this context. We introduce a novel ensemble algorithm for combining outlines using two voting schemes, a likelihood ratio test and dynamic time warping (DTW). Our goal is to minimize the human intervention required, hence we investigate alternative ways of training a classifier to determine whether an outline is in fact correct or not. We evaluate outlining and classification on a set of 1370 images. We conclude that ensembling with DTW improves performance of all outlining algorithms, that the contouring algorithm used with the DTW ensemble performs the best of those assessed, and that the most effective classifier of hand outlines assessed is a random forest applied to outlines transformed into principal components
Development of a Low-Cost Optical Sensor to Detect Eutrophication in Irrigation Reservoirs
[EN] In irrigation ponds, the excess of nutrients can cause eutrophication, a massive growth of microscopic algae. It might cause different problems in the irrigation infrastructure and should be monitored. In this paper, we present a low-cost sensor based on optical absorption in order to determine the concentration of algae in irrigation ponds. The sensor is composed of 5 LEDs with different wavelengths and light-dependent resistances as photoreceptors. Data are gathered for the calibration of the prototype, including two turbidity sources, sediment and algae, including pure samples and mixed samples. Samples were measured at a different concentration from 15 mg/L to 4000 mg/L. Multiple regression models and artificial neural networks, with a training and validation phase, are compared as two alternative methods to classify the tested samples. Our results indicate that using multiple regression models, it is possible to estimate the concentration of alga with an average absolute error of 32.0 mg/L and an average relative error of 11.0%. On the other hand, it is possible to classify up to 100% of the samples in the validation phase with the artificial neural network. Thus, a novel prototype capable of distinguishing turbidity sources and two classification methodologies, which can be adapted to different node features, are proposed for the operation of the developed prototype.This work is partially funded by the Ministerio de Educacion, Cultura y Deporte through the"Ayudas para contratacion pre-doctoral de Formacion del Profesorado Universitario FPU (Convocatoria 2016)" grant number FPU16/05540 and by the Conselleria de Educacion, Cultura y Deporte through the "Subvenciones para la contratacion de personal investigador en fase postdoctoral", grant number APOSTD/2019/04.Rocher-Morant, J.; Parra-Boronat, L.; Jimenez, JM.; Lloret, J.; Basterrechea-Chertudi, DA. (2021). Development of a Low-Cost Optical Sensor to Detect Eutrophication in Irrigation Reservoirs. Sensors. 21(22):1-20. https://doi.org/10.3390/s21227637S120212
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Hardness of Approximate Nearest Neighbor Search
We prove conditional near-quadratic running time lower bounds for approximate
Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance.
Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false,
for every there exists a constant such that computing a
-approximation to the Bichromatic Closest Pair requires
time. In particular, this implies a near-linear query time for
Approximate Nearest Neighbor search with polynomial preprocessing time.
Our reduction uses the Distributed PCP framework of [ARW'17], but obtains
improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG
codes have been constructed in other settings before [BKKMS'16, BCGRS'17], but
our construction is the first to yield new hardness results
The prevalence of axial spondyloarthritis in the UK: a cross-sectional cohort study
Background: Accurate prevalence data are important when interpreting diagnostic tests and planning for the health needs of a population, yet no such data exist for axial spondyloarthritis (axSpA) in the UK. In this cross-sectional cohort study we aimed to estimate the prevalence of axSpA in a UK primary care population. Methods: A validated self-completed questionnaire was used to screen primary care patients with low back pain for inflammatory back pain (IBP). Patients with a verifiable pre-existing diagnosis of axSpA were included as positive cases. All other patients meeting the Assessment of SpondyloArthritis international Society (ASAS) IBP criteria were invited to undergo further assessment including MRI scanning, allowing classification according to the European Spondyloarthropathy Study Group (ESSG) and ASAS axSpA criteria, and the modified New York (mNY) criteria for ankylosing spondylitis (AS). Results: Of 978 questionnaires sent to potential participants 505 were returned (response rate 51.6 %). Six subjects had a prior diagnosis of axSpA, 4 of whom met mNY criteria. Thirty eight of 75 subjects meeting ASAS IBP criteria attended review (mean age 53.5 years, 37 % male). The number of subjects satisfying classification criteria was 23 for ESSG, 3 for ASAS (2 clinical, 1 radiological) and 1 for mNY criteria. This equates to a prevalence of 5.3 % (95 % CI 4.0, 6.8) using ESSG, 1.3 % (95 % CI 0.8, 2.3) using ASAS, 0.66 % (95 % CI 0.28, 1.3) using mNY criteria in chronic back pain patients, and 1.2 % (95 % CI 0.9, 1.4) using ESSG, 0.3 % (95 % CI 0.13, 0.48) using ASAS, 0.15 % (95 % CI 0.02, 0.27) using mNY criteria in the general adult primary care population. Conclusions: These are the first prevalence estimates for axSpA in the UK, and will be of importance in planning for the future healthcare needs of this population. Trial registration: Current Controlled Trials ISRCTN7687321
- …