9,240 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
SANNS: Scaling Up Secure Approximate k-Nearest Neighbors Search
The -Nearest Neighbor Search (-NNS) is the backbone of several
cloud-based services such as recommender systems, face recognition, and
database search on text and images. In these services, the client sends the
query to the cloud server and receives the response in which case the query and
response are revealed to the service provider. Such data disclosures are
unacceptable in several scenarios due to the sensitivity of data and/or privacy
laws.
In this paper, we introduce SANNS, a system for secure -NNS that keeps
client's query and the search result confidential. SANNS comprises two
protocols: an optimized linear scan and a protocol based on a novel sublinear
time clustering-based algorithm. We prove the security of both protocols in the
standard semi-honest model. The protocols are built upon several
state-of-the-art cryptographic primitives such as lattice-based additively
homomorphic encryption, distributed oblivious RAM, and garbled circuits. We
provide several contributions to each of these primitives which are applicable
to other secure computation tasks. Both of our protocols rely on a new circuit
for the approximate top- selection from numbers that is built from comparators.
We have implemented our proposed system and performed extensive experimental
results on four datasets in two different computation environments,
demonstrating more than faster response time compared to
optimally implemented protocols from the prior work. Moreover, SANNS is the
first work that scales to the database of 10 million entries, pushing the limit
by more than two orders of magnitude.Comment: 18 pages, to appear at USENIX Security Symposium 202
Efficient Match Pair Retrieval for Large-scale UAV Images via Graph Indexed Global Descriptor
SfM (Structure from Motion) has been extensively used for UAV (Unmanned
Aerial Vehicle) image orientation. Its efficiency is directly influenced by
feature matching. Although image retrieval has been extensively used for match
pair selection, high computational costs are consumed due to a large number of
local features and the large size of the used codebook. Thus, this paper
proposes an efficient match pair retrieval method and implements an integrated
workflow for parallel SfM reconstruction. First, an individual codebook is
trained online by considering the redundancy of UAV images and local features,
which avoids the ambiguity of training codebooks from other datasets. Second,
local features of each image are aggregated into a single high-dimension global
descriptor through the VLAD (Vector of Locally Aggregated Descriptors)
aggregation by using the trained codebook, which remarkably reduces the number
of features and the burden of nearest neighbor searching in image indexing.
Third, the global descriptors are indexed via the HNSW (Hierarchical Navigable
Small World) based graph structure for the nearest neighbor searching. Match
pairs are then retrieved by using an adaptive threshold selection strategy and
utilized to create a view graph for divide-and-conquer based parallel SfM
reconstruction. Finally, the performance of the proposed solution has been
verified using three large-scale UAV datasets. The test results demonstrate
that the proposed solution accelerates match pair retrieval with a speedup
ratio ranging from 36 to 108 and improves the efficiency of SfM reconstruction
with competitive accuracy in both relative and absolute orientation
Evaluating the Differences of Gridding Techniques for Digital Elevation Models Generation and Their Influence on the Modeling of Stony Debris Flows Routing: A Case Study From Rovina di Cancia Basin (North-Eastern Italian Alps)
Debris \ufb02ows are among the most hazardous phenomena in mountain areas. To cope
with debris \ufb02ow hazard, it is common to delineate the risk-prone areas through
routing models. The most important input to debris \ufb02ow routing models are the
topographic data, usually in the form of Digital Elevation Models (DEMs). The quality
of DEMs depends on the accuracy, density, and spatial distribution of the sampled
points; on the characteristics of the surface; and on the applied gridding methodology.
Therefore, the choice of the interpolation method affects the realistic representation
of the channel and fan morphology, and thus potentially the debris \ufb02ow routing
modeling outcomes. In this paper, we initially investigate the performance of common
interpolation methods (i.e., linear triangulation, natural neighbor, nearest neighbor,
Inverse Distance to a Power, ANUDEM, Radial Basis Functions, and ordinary kriging)
in building DEMs with the complex topography of a debris \ufb02ow channel located
in the Venetian Dolomites (North-eastern Italian Alps), by using small footprint full-
waveform Light Detection And Ranging (LiDAR) data. The investigation is carried
out through a combination of statistical analysis of vertical accuracy, algorithm
robustness, and spatial clustering of vertical errors, and multi-criteria shape reliability
assessment. After that, we examine the in\ufb02uence of the tested interpolation algorithms
on the performance of a Geographic Information System (GIS)-based cell model for
simulating stony debris \ufb02ows routing. In detail, we investigate both the correlation
between the DEMs heights uncertainty resulting from the gridding procedure and
that on the corresponding simulated erosion/deposition depths, both the effect of
interpolation algorithms on simulated areas, erosion and deposition volumes, solid-liquid
discharges, and channel morphology after the event. The comparison among the tested
interpolation methods highlights that the ANUDEM and ordinary kriging algorithms
are not suitable for building DEMs with complex topography. Conversely, the linear
triangulation, the natural neighbor algorithm, and the thin-plate spline plus tension and completely regularized spline functions ensure the best trade-off among accuracy
and shape reliability. Anyway, the evaluation of the effects of gridding techniques on
debris \ufb02ow routing modeling reveals that the choice of the interpolation algorithm does
not signi\ufb01cantly affect the model outcomes
- …