21,107 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Euclidean distance geometry and applications
Euclidean distance geometry is the study of Euclidean geometry based on the
concept of distance. This is useful in several applications where the input
data consists of an incomplete set of distances, and the output is a set of
points in Euclidean space that realizes the given distances. We survey some of
the theory of Euclidean distance geometry and some of the most important
applications: molecular conformation, localization of sensor networks and
statics.Comment: 64 pages, 21 figure
Gains in Power from Structured Two-Sample Tests of Means on Graphs
We consider multivariate two-sample tests of means, where the location shift
between the two populations is expected to be related to a known graph
structure. An important application of such tests is the detection of
differentially expressed genes between two patient populations, as shifts in
expression levels are expected to be coherent with the structure of graphs
reflecting gene properties such as biological process, molecular function,
regulation, or metabolism. For a fixed graph of interest, we demonstrate that
accounting for graph structure can yield more powerful tests under the
assumption of smooth distribution shift on the graph. We also investigate the
identification of non-homogeneous subgraphs of a given large graph, which poses
both computational and multiple testing problems. The relevance and benefits of
the proposed approach are illustrated on synthetic data and on breast cancer
gene expression data analyzed in context of KEGG pathways
Basic Understanding of Condensed Phases of Matter via Packing Models
Packing problems have been a source of fascination for millenia and their
study has produced a rich literature that spans numerous disciplines.
Investigations of hard-particle packing models have provided basic insights
into the structure and bulk properties of condensed phases of matter, including
low-temperature states (e.g., molecular and colloidal liquids, crystals and
glasses), multiphase heterogeneous media, granular media, and biological
systems. The densest packings are of great interest in pure mathematics,
including discrete geometry and number theory. This perspective reviews
pertinent theoretical and computational literature concerning the equilibrium,
metastable and nonequilibrium packings of hard-particle packings in various
Euclidean space dimensions. In the case of jammed packings, emphasis will be
placed on the "geometric-structure" approach, which provides a powerful and
unified means to quantitatively characterize individual packings via jamming
categories and "order" maps. It incorporates extremal jammed states, including
the densest packings, maximally random jammed states, and lowest-density jammed
structures. Packings of identical spheres, spheres with a size distribution,
and nonspherical particles are also surveyed. We close this review by
identifying challenges and open questions for future research.Comment: 33 pages, 20 figures, Invited "Perspective" submitted to the Journal
of Chemical Physics. arXiv admin note: text overlap with arXiv:1008.298
A geometric method for model reduction of biochemical networks with polynomial rate functions
Model reduction of biochemical networks relies on the knowledge of slow and
fast variables. We provide a geometric method, based on the Newton polytope, to
identify slow variables of a biochemical network with polynomial rate
functions. The gist of the method is the notion of tropical equilibration that
provides approximate descriptions of slow invariant manifolds. Compared to
extant numerical algorithms such as the intrinsic low dimensional manifold
method, our approach is symbolic and utilizes orders of magnitude instead of
precise values of the model parameters. Application of this method to a large
collection of biochemical network models supports the idea that the number of
dynamical variables in minimal models of cell physiology can be small, in spite
of the large number of molecular regulatory actors
- …