6,931 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Robust Low-Rank Subspace Segmentation with Semidefinite Guarantees
Recently there is a line of research work proposing to employ Spectral
Clustering (SC) to segment (group){Throughout the paper, we use segmentation,
clustering, and grouping, and their verb forms, interchangeably.}
high-dimensional structural data such as those (approximately) lying on
subspaces {We follow {liu2010robust} and use the term "subspace" to denote both
linear subspaces and affine subspaces. There is a trivial conversion between
linear subspaces and affine subspaces as mentioned therein.} or low-dimensional
manifolds. By learning the affinity matrix in the form of sparse
reconstruction, techniques proposed in this vein often considerably boost the
performance in subspace settings where traditional SC can fail. Despite the
success, there are fundamental problems that have been left unsolved: the
spectrum property of the learned affinity matrix cannot be gauged in advance,
and there is often one ugly symmetrization step that post-processes the
affinity for SC input. Hence we advocate to enforce the symmetric positive
semidefinite constraint explicitly during learning (Low-Rank Representation
with Positive SemiDefinite constraint, or LRR-PSD), and show that factually it
can be solved in an exquisite scheme efficiently instead of general-purpose SDP
solvers that usually scale up poorly. We provide rigorous mathematical
derivations to show that, in its canonical form, LRR-PSD is equivalent to the
recently proposed Low-Rank Representation (LRR) scheme {liu2010robust}, and
hence offer theoretic and practical insights to both LRR-PSD and LRR, inviting
future research. As per the computational cost, our proposal is at most
comparable to that of LRR, if not less. We validate our theoretic analysis and
optimization scheme by experiments on both synthetic and real data sets.Comment: 10 pages, 4 figures. Accepted by ICDM Workshop on Optimization Based
Methods for Emerging Data Mining Problems (OEDM), 2010. Main proof simplified
and typos corrected. Experimental data slightly adde
Complexity, BioComplexity, the Connectionist Conjecture and Ontology of Complexity\ud
This paper develops and integrates major ideas and concepts on complexity and biocomplexity - the connectionist conjecture, universal ontology of complexity, irreducible complexity of totality & inherent randomness, perpetual evolution of information, emergence of criticality and equivalence of symmetry & complexity. This paper introduces the Connectionist Conjecture which states that the one and only representation of Totality is the connectionist one i.e. in terms of nodes and edges. This paper also introduces an idea of Universal Ontology of Complexity and develops concepts in that direction. The paper also develops ideas and concepts on the perpetual evolution of information, irreducibility and computability of totality, all in the context of the Connectionist Conjecture. The paper indicates that the control and communication are the prime functionals that are responsible for the symmetry and complexity of complex phenomenon. The paper takes the stand that the phenomenon of life (including its evolution) is probably the nearest to what we can describe with the term “complexity”. The paper also assumes that signaling and communication within the living world and of the living world with the environment creates the connectionist structure of the biocomplexity. With life and its evolution as the substrate, the paper develops ideas towards the ontology of complexity. The paper introduces new complexity theoretic interpretations of fundamental biomolecular parameters. The paper also develops ideas on the methodology to determine the complexity of “true” complex phenomena.\u
- …