80,951 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Lower Bounds on the Oracle Complexity of Nonsmooth Convex Optimization via Information Theory
We present an information-theoretic approach to lower bound the oracle
complexity of nonsmooth black box convex optimization, unifying previous lower
bounding techniques by identifying a combinatorial problem, namely string
guessing, as a single source of hardness. As a measure of complexity we use
distributional oracle complexity, which subsumes randomized oracle complexity
as well as worst-case oracle complexity. We obtain strong lower bounds on
distributional oracle complexity for the box , as well as for the
-ball for (for both low-scale and large-scale regimes),
matching worst-case upper bounds, and hence we close the gap between
distributional complexity, and in particular, randomized complexity, and
worst-case complexity. Furthermore, the bounds remain essentially the same for
high-probability and bounded-error oracle complexity, and even for combination
of the two, i.e., bounded-error high-probability oracle complexity. This
considerably extends the applicability of known bounds
Design of Combined Coverage Area Reporting and Geo-casting of Queries for Wireless Sensor Networks
In order to efficiently deal with queries or other location dependent information, it is key that the wireless sensor network informs gateways what geographical area is serviced by which gateway. The gateways are then able to e.g. efficiently route queries which are only valid in particular regions of the deployment. The proposed algorithms combine coverage area reporting and geographical routing of queries which are injected by gateways.\u
Approximate Closest Community Search in Networks
Recently, there has been significant interest in the study of the community
search problem in social and information networks: given one or more query
nodes, find densely connected communities containing the query nodes. However,
most existing studies do not address the "free rider" issue, that is, nodes far
away from query nodes and irrelevant to them are included in the detected
community. Some state-of-the-art models have attempted to address this issue,
but not only are their formulated problems NP-hard, they do not admit any
approximations without restrictive assumptions, which may not always hold in
practice.
In this paper, given an undirected graph G and a set of query nodes Q, we
study community search using the k-truss based community model. We formulate
our problem of finding a closest truss community (CTC), as finding a connected
k-truss subgraph with the largest k that contains Q, and has the minimum
diameter among such subgraphs. We prove this problem is NP-hard. Furthermore,
it is NP-hard to approximate the problem within a factor , for
any . However, we develop a greedy algorithmic framework,
which first finds a CTC containing Q, and then iteratively removes the furthest
nodes from Q, from the graph. The method achieves 2-approximation to the
optimal solution. To further improve the efficiency, we make use of a compact
truss index and develop efficient algorithms for k-truss identification and
maintenance as nodes get eliminated. In addition, using bulk deletion
optimization and local exploration strategies, we propose two more efficient
algorithms. One of them trades some approximation quality for efficiency while
the other is a very efficient heuristic. Extensive experiments on 6 real-world
networks show the effectiveness and efficiency of our community model and
search algorithms
- …