Search CORE

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

PubMed Central

Composite repetition-aware data structures

Author: A Blumer
A Lempel
D Arroyuelo
D Belazzougui
DE Willard
J Radoszewski
J Sirén
J Ziv
M Crochemore
M Crochemore
M Raffinot
P Ferragina
S Kreft
T Gagie
V Mäkinen
V Mäkinen
W Rytter
Publication venue
Publication date: 01/01/2015
Field of study

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

The Future of Computation

Author: Patel Apoorva
Publication venue
Publication date: 07/03/2005
Field of study

``The purpose of life is to obtain knowledge, use it to live with as much satisfaction as possible, and pass it on with improvements and modifications to the next generation.'' This may sound philosophical, and the interpretation of words may be subjective, yet it is fairly clear that this is what all living organisms--from bacteria to human beings--do in their life time. Indeed, this can be adopted as the information theoretic definition of life. Over billions of years, biological evolution has experimented with a wide range of physical systems for acquiring, processing and communicating information. We are now in a position to make the principles behind these systems mathematically precise, and then extend them as far as laws of physics permit. Therein lies the future of computation, of ourselves, and of life.Comment: 7 pages, Revtex. Invited lecture at the Workshop on Quantum Information, Computation and Communication (QICC-2005), IIT Kharagpur, India, February 200

arXiv.org e-Print Archive

CERN Document Server

Quasi-Chemical Theory and Implicit Solvent Models for Simulations

Author: Pratt Lawrence R.
Rempe Susan B.
Publication venue: 'AIP Publishing'
Publication date: 01/01/1999
Field of study

A statistical thermodynamic development is given of a new implicit solvent model that avoids the traditional system size limitations of computer simulation of macromolecular solutions with periodic boundary conditions. This implicit solvent model is based upon the quasi-chemical approach, distinct from the common integral equation trunk of the theory of liquid solutions. The physical content of this theory is the hypothesis that a small set of solvent molecules are decisive for these solvation problems. A detailed derivation of the quasi-chemical theory escorts the development of this proposal. The numerical application of the quasi-chemical treatment to Li

^+

ion hydration in liquid water is used to motivate and exemplify the quasi-chemical theory. Those results underscore the fact that the quasi-chemical approach refines the path for utilization of ion-water cluster results for the statistical thermodynamics of solutions.Comment: 30 pages, contribution to Santa Fe Workshop on Treatment of Electrostatic Interactions in Computer Simulation of Condensed Medi

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server