21,107 research outputs found

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Euclidean distance geometry and applications

    Full text link
    Euclidean distance geometry is the study of Euclidean geometry based on the concept of distance. This is useful in several applications where the input data consists of an incomplete set of distances, and the output is a set of points in Euclidean space that realizes the given distances. We survey some of the theory of Euclidean distance geometry and some of the most important applications: molecular conformation, localization of sensor networks and statics.Comment: 64 pages, 21 figure

    Gains in Power from Structured Two-Sample Tests of Means on Graphs

    Get PDF
    We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways

    Basic Understanding of Condensed Phases of Matter via Packing Models

    Full text link
    Packing problems have been a source of fascination for millenia and their study has produced a rich literature that spans numerous disciplines. Investigations of hard-particle packing models have provided basic insights into the structure and bulk properties of condensed phases of matter, including low-temperature states (e.g., molecular and colloidal liquids, crystals and glasses), multiphase heterogeneous media, granular media, and biological systems. The densest packings are of great interest in pure mathematics, including discrete geometry and number theory. This perspective reviews pertinent theoretical and computational literature concerning the equilibrium, metastable and nonequilibrium packings of hard-particle packings in various Euclidean space dimensions. In the case of jammed packings, emphasis will be placed on the "geometric-structure" approach, which provides a powerful and unified means to quantitatively characterize individual packings via jamming categories and "order" maps. It incorporates extremal jammed states, including the densest packings, maximally random jammed states, and lowest-density jammed structures. Packings of identical spheres, spheres with a size distribution, and nonspherical particles are also surveyed. We close this review by identifying challenges and open questions for future research.Comment: 33 pages, 20 figures, Invited "Perspective" submitted to the Journal of Chemical Physics. arXiv admin note: text overlap with arXiv:1008.298

    A geometric method for model reduction of biochemical networks with polynomial rate functions

    Full text link
    Model reduction of biochemical networks relies on the knowledge of slow and fast variables. We provide a geometric method, based on the Newton polytope, to identify slow variables of a biochemical network with polynomial rate functions. The gist of the method is the notion of tropical equilibration that provides approximate descriptions of slow invariant manifolds. Compared to extant numerical algorithms such as the intrinsic low dimensional manifold method, our approach is symbolic and utilizes orders of magnitude instead of precise values of the model parameters. Application of this method to a large collection of biochemical network models supports the idea that the number of dynamical variables in minimal models of cell physiology can be small, in spite of the large number of molecular regulatory actors
    • …
    corecore