3 research outputs found

    Support Estimation with Sampling Artifacts and Errors

    Full text link
    The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods

    Sharp Analytical Capacity Upper Bounds for Sticky and Related Channels

    Full text link
    We study natural examples of binary channels with synchronization errors. These include the duplication channel, which independently outputs a given bit once or twice, and geometric channels that repeat a given bit according to a geometric rule, with or without the possibility of bit deletion. We apply the general framework of Cheraghchi (STOC 2018) to obtain sharp analytical upper bounds on the capacity of these channels. Previously, upper bounds were known via numerical computations involving the computation of finite approximations of the channels by a computer and then using the obtained numerical results to upper bound the actual capacity. While leading to sharp numerical results, further progress on the full understanding of the channel capacity inherently remains elusive using such methods. Our results can be regarded as a major step towards a complete understanding of the capacity curves. Quantitatively, our upper bounds sharply approach, and in some cases surpass, the bounds that were previously only known by purely numerical methods. Among our results, we notably give a completely analytical proof that, when the number of repetitions per bit is geometric (supported on {0,1,2,… }\{0,1,2,\dots\}) with mean growing to infinity, the channel capacity remains substantially bounded away from 11.Comment: 37 pages, 12 figures. Fixed some typos and reorganized parts of Section

    An Overview of Capacity Results for Synchronization Channels

    Full text link
    Synchronization channels, such as the well-known deletion channel, are surprisingly harder to analyze than memoryless channels, and they are a source of many fundamental problems in information theory and theoretical computer science. One of the most basic open problems regarding synchronization channels is the derivation of an exact expression for their capacity. Unfortunately, most of the classic information-theoretic techniques at our disposal fail spectacularly when applied to synchronization channels. Therefore, new approaches must be considered to tackle this problem. This survey gives an account of the great effort made over the past few decades to better understand the (broadly defined) capacity of synchronization channels, including both the main results and the novel techniques underlying them. Besides the usual notion of channel capacity, we also discuss the zero-error capacity of synchronization channels.Comment: 40 pages, 11 figures. Corrected some typos and a reference. Survey, comments are welcom
    corecore