35 research outputs found
Sketching via hashing: from heavy hitters to compressed sensing to sparse fourier transform
Sketching via hashing is a popular and useful method for processing large data sets. Its basic idea is as follows. Suppose that we have a large multi-set of elements S=[formula], and we would like to identify the elements that occur “frequently" in S. The algorithm starts by selecting a hash function h that maps the elements into an array c[1…m]. The array entries are initialized to 0. Then, for each element a ∈ S, the algorithm increments c[h(a)]. At the end of the process, each array entry c[j] contains the count of all data elements a ∈ S mapped to j
Group testing:an information theory perspective
The group testing problem concerns discovering a small number of defective
items within a large population by performing tests on pools of items. A test
is positive if the pool contains at least one defective, and negative if it
contains no defectives. This is a sparse inference problem with a combinatorial
flavour, with applications in medical testing, biology, telecommunications,
information technology, data science, and more. In this monograph, we survey
recent developments in the group testing problem from an information-theoretic
perspective. We cover several related developments: efficient algorithms with
practical storage and computation requirements, achievability bounds for
optimal decoding methods, and algorithm-independent converse bounds. We assess
the theoretical guarantees not only in terms of scaling laws, but also in terms
of the constant factors, leading to the notion of the {\em rate} of group
testing, indicating the amount of information learned per test. Considering
both noiseless and noisy settings, we identify several regimes where existing
algorithms are provably optimal or near-optimal, as well as regimes where there
remains greater potential for improvement. In addition, we survey results
concerning a number of variations on the standard group testing problem,
including partial recovery criteria, adaptive algorithms with a limited number
of stages, constrained test designs, and sublinear-time algorithms.Comment: Survey paper, 140 pages, 19 figures. To be published in Foundations
and Trends in Communications and Information Theor
Inference And Learning: Computational Difficulty And Efficiency
In this thesis, we mainly investigate two collections of problems: statistical network inference and model selection in regression. The common feature shared by these two types of problems is that they typically exhibit an interesting phenomenon in terms of computational difficulty and efficiency.
For statistical network inference, our goal is to infer the network structure based on a noisy observation of the network. Statistically, we model the network as generated from the structural information with the presence of noise, for example, planted submatrix model (for bipartite weighted graph), stochastic block model, and Watts-Strogatz model. As the relative amount of ``signal-to-noise\u27\u27 varies, the problems exhibit different stages of computational difficulty. On the theoretical side, we investigate these stages through characterizing the transition thresholds on the ``signal-to-noise\u27\u27 ratio, for the aforementioned models. On the methodological side, we provide new computationally efficient procedures to reconstruct the network structure for each model.
For model selection in regression, our goal is to learn a ``good\u27\u27 model based on a certain model class from the observed data sequences (feature and response pairs), when the model can be misspecified. More concretely, we study two model selection problems: to learn from general classes of functions based on i.i.d. data with minimal assumptions, and to select from the sparse linear model class based on possibly adversarially chosen data in a sequential fashion. We develop new theoretical and algorithmic tools beyond empirical risk minimization to study these problems from a learning theory point of view
Model-Based and Graph-Based Priors for Group Testing
The goal of the group testing problem is to identify a set of defective items
within a larger set of items, using suitably-designed tests whose outcomes
indicate whether any defective item is present. In this paper, we study how the
number of tests can be significantly decreased by leveraging the structural
dependencies between the items, i.e., by incorporating prior information. To do
so, we pursue two different perspectives: (i) As a generalization of the
uniform combinatorial prior, we consider the case that the defective set is
uniform over a \emph{subset} of all possible sets of a given size, and study
how this impacts the information-theoretic limits on the number of tests for
approximate recovery; (ii) As a generalization of the i.i.d.~prior, we
introduce a new class of priors based on the Ising model, where the associated
graph represents interactions between items. We show that this naturally leads
to an Integer Quadratic Program decoder, which can be converted to an Integer
Linear Program and/or relaxed to a non-integer variant for improved
computational complexity, while maintaining strong empirical recovery
performance.Comment: IEEE Transactions on Signal Processin