2,732 research outputs found
Memory-Constrained Algorithms for Simple Polygons
A constant-workspace algorithm has read-only access to an input array and may
use only O(1) additional words of bits, where is the size of
the input. We assume that a simple -gon is given by the ordered sequence of
its vertices. We show that we can find a triangulation of a plane straight-line
graph in time. We also consider preprocessing a simple polygon for
shortest path queries when the space constraint is relaxed to allow words
of working space. After a preprocessing of time, we are able to solve
shortest path queries between any two points inside the polygon in
time.Comment: Preprint appeared in EuroCG 201
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
The Future of Computation
``The purpose of life is to obtain knowledge, use it to live with as much
satisfaction as possible, and pass it on with improvements and modifications to
the next generation.'' This may sound philosophical, and the interpretation of
words may be subjective, yet it is fairly clear that this is what all living
organisms--from bacteria to human beings--do in their life time. Indeed, this
can be adopted as the information theoretic definition of life. Over billions
of years, biological evolution has experimented with a wide range of physical
systems for acquiring, processing and communicating information. We are now in
a position to make the principles behind these systems mathematically precise,
and then extend them as far as laws of physics permit. Therein lies the future
of computation, of ourselves, and of life.Comment: 7 pages, Revtex. Invited lecture at the Workshop on Quantum
Information, Computation and Communication (QICC-2005), IIT Kharagpur, India,
February 200
Quasi-Chemical Theory and Implicit Solvent Models for Simulations
A statistical thermodynamic development is given of a new implicit solvent
model that avoids the traditional system size limitations of computer
simulation of macromolecular solutions with periodic boundary conditions. This
implicit solvent model is based upon the quasi-chemical approach, distinct from
the common integral equation trunk of the theory of liquid solutions. The
physical content of this theory is the hypothesis that a small set of solvent
molecules are decisive for these solvation problems. A detailed derivation of
the quasi-chemical theory escorts the development of this proposal. The
numerical application of the quasi-chemical treatment to Li ion hydration
in liquid water is used to motivate and exemplify the quasi-chemical theory.
Those results underscore the fact that the quasi-chemical approach refines the
path for utilization of ion-water cluster results for the statistical
thermodynamics of solutions.Comment: 30 pages, contribution to Santa Fe Workshop on Treatment of
Electrostatic Interactions in Computer Simulation of Condensed Medi
- …