464 research outputs found

    Distribution Re-weighting and Voting Paradoxes

    Full text link
    We explore a specific type of distribution shift called domain expertise, in which training is limited to a subset of all possible labels. This setting is common among specialized human experts, or specific focused studies. We show how the standard approach to distribution shift, which involves re-weighting data, can result in paradoxical disagreements among differing domain expertise. We also demonstrate how standard adjustments for causal inference lead to the same paradox. We prove that the characteristics of these paradoxes exactly mimic another set of paradoxes which arise among sets of voter preferences

    On the Information Capacity of Nearest Neighbor Representations

    Full text link
    The von Neumann Computer Architecture\textit{von Neumann Computer Architecture} has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of associative computation\textit{associative computation} where memory is defined by a set of vectors in Rn\mathbb{R}^n (that we call anchors\textit{anchors}), computation is performed by convergence from an input vector to a nearest neighbor anchor, and the output is a label associated with an anchor. Specifically, in this paper, we study the representation of Boolean functions in the associative computation model, where the inputs are binary vectors and the corresponding outputs are the labels (00 or 11) of the nearest neighbor anchors. The information capacity of a Boolean function in this model is associated with two quantities: (i)\textit{(i)} the number of anchors (called Nearest Neighbor (NN) Complexity\textit{Nearest Neighbor (NN) Complexity}) and (ii)\textit{(ii)} the maximal number of bits representing entries of anchors (called Resolution\textit{Resolution}). We study symmetric Boolean functions and present constructions that have optimal NN complexity and resolution.Comment: The conference version is submitted to and accepted by ISIT 202

    Robust Indexing for the Sliced Channel: Almost Optimal Codes for Substitutions and Deletions

    Full text link
    Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and insertion errors for this channel model. The key ingredient in the code construction is a technique we call robust indexing: simultaneously assigning indices to unordered strings (hence, creating order) and also embedding information in these indices. The encoded indices are resilient to substitution, deletion, and insertion errors, and therefore, so is the entire code

    Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems

    Get PDF
    Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that k-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems

    What is the Value of Data? on Mathematical Methods for Data Quality Estimation

    Get PDF
    Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset’s quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques
    • …