849 research outputs found
Recommended from our members
Algorithms to Exploit Data Sparsity
While data in the real world is very high-dimensional, it generally has some underlying structure; for instance, if we think of an image as a set of pixels with associated color values, most possible settings of color values correspond to something more like random noise than what we typically think of as a picture. With an appropriate transformation of basis, this underlying structure can often be converted into sparsity in data, giving an equivalent representation of the data where the magnitude is large in only a few directions relative to the ambient dimension. This motivates a variety of theoretical questions around designing algorithms that can exploit this data sparsity to achieve better performance than what would be possible naively, and in this thesis we tackle several such questions.We first examine the question of simply approximating the level of sparsity of a signal under several different measurement models, a natural first step if the sparsity is to be exploited by other algorithms. Second, we look at a particular sparse signal recovery problem called nonadaptive probabilistic group testing, and investigate the question of exactly how sparse the signal needs to be before the methods used for recovering sparse signals outperform those used for non-sparse signals. Third, we prove novel upper bounds on the number of measurements needed to recover a sparse signal in the universal one-bit compressed sensing model of sparse signal recovery. Fourth, we give some approximations of an information-theoretic quantity called the index coding rate of a network modeled by a graph, in the special case that the graph is sparse or otherwise highly structured. For each of the problems considered, we also discuss some remaining open questions and conjectures, as well as possible directions towards their solutions
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
- …