28,830 research outputs found
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Strange Bedfellows: Quantum Mechanics and Data Mining
Last year, in 2008, I gave a talk titled {\it Quantum Calisthenics}. This
year I am going to tell you about how the work I described then has spun off
into a most unlikely direction. What I am going to talk about is how one maps
the problem of finding clusters in a given data set into a problem in quantum
mechanics. I will then use the tricks I described to let quantum evolution lets
the clusters come together on their own.Comment: 11 pages, 7 figures, Invited Talk at Light Cone 200
Rigidity and flexibility of biological networks
The network approach became a widely used tool to understand the behaviour of
complex systems in the last decade. We start from a short description of
structural rigidity theory. A detailed account on the combinatorial rigidity
analysis of protein structures, as well as local flexibility measures of
proteins and their applications in explaining allostery and thermostability is
given. We also briefly discuss the network aspects of cytoskeletal tensegrity.
Finally, we show the importance of the balance between functional flexibility
and rigidity in protein-protein interaction, metabolic, gene regulatory and
neuronal networks. Our summary raises the possibility that the concepts of
flexibility and rigidity can be generalized to all networks.Comment: 21 pages, 4 figures, 1 tabl
Alpha, Betti and the Megaparsec Universe: on the Topology of the Cosmic Web
We study the topology of the Megaparsec Cosmic Web in terms of the
scale-dependent Betti numbers, which formalize the topological information
content of the cosmic mass distribution. While the Betti numbers do not fully
quantify topology, they extend the information beyond conventional cosmological
studies of topology in terms of genus and Euler characteristic. The richer
information content of Betti numbers goes along the availability of fast
algorithms to compute them.
For continuous density fields, we determine the scale-dependence of Betti
numbers by invoking the cosmologically familiar filtration of sublevel or
superlevel sets defined by density thresholds. For the discrete galaxy
distribution, however, the analysis is based on the alpha shapes of the
particles. These simplicial complexes constitute an ordered sequence of nested
subsets of the Delaunay tessellation, a filtration defined by the scale
parameter, . As they are homotopy equivalent to the sublevel sets of
the distance field, they are an excellent tool for assessing the topological
structure of a discrete point distribution. In order to develop an intuitive
understanding for the behavior of Betti numbers as a function of , and
their relation to the morphological patterns in the Cosmic Web, we first study
them within the context of simple heuristic Voronoi clustering models.
Subsequently, we address the topology of structures emerging in the standard
LCDM scenario and in cosmological scenarios with alternative dark energy
content. The evolution and scale-dependence of the Betti numbers is shown to
reflect the hierarchical evolution of the Cosmic Web and yields a promising
measure of cosmological parameters. We also discuss the expected Betti numbers
as a function of the density threshold for superlevel sets of a Gaussian random
field.Comment: 42 pages, 14 figure
A self-learning algorithm for biased molecular dynamics
A new self-learning algorithm for accelerated dynamics, reconnaissance
metadynamics, is proposed that is able to work with a very large number of
collective coordinates. Acceleration of the dynamics is achieved by
constructing a bias potential in terms of a patchwork of one-dimensional,
locally valid collective coordinates. These collective coordinates are obtained
from trajectory analyses so that they adapt to any new features encountered
during the simulation. We show how this methodology can be used to enhance
sampling in real chemical systems citing examples both from the physics of
clusters and from the biological sciences.Comment: 6 pages, 5 figures + 9 pages of supplementary informatio
Statistical Methods in Topological Data Analysis for Complex, High-Dimensional Data
The utilization of statistical methods an their applications within the new
field of study known as Topological Data Analysis has has tremendous potential
for broadening our exploration and understanding of complex, high-dimensional
data spaces. This paper provides an introductory overview of the mathematical
underpinnings of Topological Data Analysis, the workflow to convert samples of
data to topological summary statistics, and some of the statistical methods
developed for performing inference on these topological summary statistics. The
intention of this non-technical overview is to motivate statisticians who are
interested in learning more about the subject.Comment: 15 pages, 7 Figures, 27th Annual Conference on Applied Statistics in
Agricultur
- …