9 research outputs found

    Phase Transitions in the Pooled Data Problem

    Get PDF
    In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.Comment: Accepted to NIPS 201

    Group Testing with Runlength Constraints for Topological Molecular Storage

    Full text link
    Motivated by applications in topological DNA-based data storage, we introduce and study a novel setting of Non-Adaptive Group Testing (NAGT) with runlength constraints on the columns of the test matrix, in the sense that any two 1's must be separated by a run of at least d 0's. We describe and analyze a probabilistic construction of a runlength-constrained scheme in the zero-error and vanishing error settings, and show that the number of tests required by this construction is optimal up to logarithmic factors in the runlength constraint d and the number of defectives k in both cases. Surprisingly, our results show that runlength-constrained NAGT is not more demanding than unconstrained NAGT when d=O(k), and that for almost all choices of d and k it is not more demanding than NAGT with a column Hamming weight constraint only. Towards obtaining runlength-constrained Quantitative NAGT (QNAGT) schemes with good parameters, we also provide lower bounds for this setting and a nearly optimal probabilistic construction of a QNAGT scheme with a column Hamming weight constraint

    Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation

    Full text link
    Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.Comment: Accepted at Regulatable ML @ NeurIPS 202

    Group testing:an information theory perspective

    Get PDF
    The group testing problem concerns discovering a small number of defective items within a large population by performing tests on pools of items. A test is positive if the pool contains at least one defective, and negative if it contains no defectives. This is a sparse inference problem with a combinatorial flavour, with applications in medical testing, biology, telecommunications, information technology, data science, and more. In this monograph, we survey recent developments in the group testing problem from an information-theoretic perspective. We cover several related developments: efficient algorithms with practical storage and computation requirements, achievability bounds for optimal decoding methods, and algorithm-independent converse bounds. We assess the theoretical guarantees not only in terms of scaling laws, but also in terms of the constant factors, leading to the notion of the {\em rate} of group testing, indicating the amount of information learned per test. Considering both noiseless and noisy settings, we identify several regimes where existing algorithms are provably optimal or near-optimal, as well as regimes where there remains greater potential for improvement. In addition, we survey results concerning a number of variations on the standard group testing problem, including partial recovery criteria, adaptive algorithms with a limited number of stages, constrained test designs, and sublinear-time algorithms.Comment: Survey paper, 140 pages, 19 figures. To be published in Foundations and Trends in Communications and Information Theor

    Algorithmic and Coding-theoretic Methods for Group Testing and Private Information Retrieval

    Get PDF
    In the first part of this dissertation, we consider the Group Testing (GT) problem and its two variants, the Quantitative GT (QGT) problem and the Coin Weighing (CW) problem. An instance of the GT problem includes a ground set of items that includes a small subset of defective items. The GT procedure consists of a number of tests, such that each test indicates whether or not a given subset of items includes one or more defective items. The goal of the GT procedure is to identify the subset of defective items with the minimum number of tests. Motivated by practical scenarios where the outcome of the tests can be affected by noise, we focus on the noisy GT setting, in which the outcome of a test can be flipped with some probability. In the noisy GT setting, the goal is to identify the set of defective items with high probability. We investigate the performance of two variants of the Belief Propagation (BP) algorithm for decoding of noisy non-adaptive GT under the combinatorial model for defective items. Through extensive simulations, we show that the proposed algorithms achieve higher success probability and lower false-negative and false-positive rates when compared to the traditional BP algorithm. We also consider a variation of the probabilistic GT model in which the prior probability of each item to be defective is not uniform and in which there is a certain amount of side information on the distribution of the defective items available to the GT algorithm. This dissertation focuses on leveraging the side information for improving the performance of decoding algorithms for noisy GT. First, we propose a probabilistic model, referred to as an interaction model, that captures the side information about the probability distribution of the defective items. Next, we present a decoding scheme, based on BP, that leverages the interaction model to improve the decoding accuracy. Our results indicate that the proposed algorithm achieves higher success probability and lower false-negative and false-positive rates when compared to the traditional BP, especially in the high noise regime. In the QGT problem, the result of a test reveals the number of defective items in the tested group. This is in contrast to the standard GT where the result of each test is either 1 or 0 depending on whether the tested group contains any defective items or not. In this dissertation, we study the QGT problem for the combinatorial and probabilistic models of defective items. We propose non-adaptive QGT algorithms using sparse graph codes over bi-regular and irregular bipartite graphs, and binary t-error-correcting BCH codes. The proposed schemes provide exact recovery with a probabilistic guarantee, i.e. recover all the defective items with high probability. The proposed schemes outperform existing non-adaptive QGT schemes for the sub-linear regime in terms of the number of tests required to identify all defective items with high probability. The CW problem lies at the intersection of GT and compressed sensing problems. Given a collection of coins and the total weight of the coins, where the weight of each coin is an unknown integer, the problem is to determine the weight of each coin by weighing subsets of coins on a spring scale. The goal is to minimize the average number of weighings over all possible weight configurations. Toward this goal, we propose and analyze a simple and effective adaptive weighing strategy. This is the first non-trivial achievable upper bound on the minimum expected required number of weighings. In the second part of this dissertation, we focus on the private information retrieval problem. In many practical settings, the user needs to retrieve information messages from a server in a periodic manner, over multiple rounds of communication. The messages are retrieved one at a time and the identity of future requests is not known to the server. We study the private information retrieval protocols that ensure that the identities of all the messages retrieved from the server are protected. This scenario can occur in practical settings such as periodic content download from text and multimedia repositories. We refer to this problem of minimizing the rate of data download as online private information retrieval problem. Following the previous line of work by Kadhe et al., we assume that the user knows a subset of messages in the database as side information. The identities of these messages are initially unknown to the server. Focusing on scalar-linear settings, we characterize the per-round capacity, i.e., the maximum achievable download rate at each round. The key idea of our achievability scheme is to combine the data downloaded during the current round and the previous rounds with the original side information messages and use the resulting data as side information for the subsequent rounds