7 research outputs found
Improved Asymptotic Bounds for Codes Correcting Insertions and Deletions
This paper studies the cardinality of codes correcting insertions and
deletions. We give an asymptotically improved upper bound on code size. The
bound is obtained by utilizing the asymmetric property of list decoding for
insertions and deletions.Comment: 9 pages, 2 fugure
Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers
Nanopore sequencers are emerging as promising new platforms for
high-throughput sequencing. As with other technologies, sequencer errors pose a
major challenge for their effective use. In this paper, we present a novel
information theoretic analysis of the impact of insertion-deletion (indel)
errors in nanopore sequencers. In particular, we consider the following
problems: (i) for given indel error characteristics and rate, what is the
probability of accurate reconstruction as a function of sequence length; (ii)
what is the number of `typical' sequences within the distortion bound induced
by indel errors; (iii) using replicated extrusion (the process of passing a DNA
strand through the nanopore), what is the number of replicas needed to reduce
the distortion bound so that only one typical sequence exists within the
distortion bound.
Our results provide a number of important insights: (i) the maximum length of
a sequence that can be accurately reconstructed in the presence of indel and
substitution errors is relatively small; (ii) the number of typical sequences
within the distortion bound is large; and (iii) replicated extrusion is an
effective technique for unique reconstruction. In particular, we show that the
number of replicas is a slow function (logarithmic) of sequence length --
implying that through replicated extrusion, we can sequence large reads using
nanopore sequencers. Our model considers indel and substitution errors
separately. In this sense, it can be viewed as providing (tight) bounds on
reconstruction lengths and repetitions for accurate reconstruction when the two
error modes are considered in a single model.Comment: 12 pages, 5 figure
Deletion codes in the high-noise and high-rate regimes
The noise model of deletions poses significant challenges in coding theory,
with basic questions like the capacity of the binary deletion channel still
being open. In this paper, we study the harder model of worst-case deletions,
with a focus on constructing efficiently decodable codes for the two extreme
regimes of high-noise and high-rate. Specifically, we construct polynomial-time
decodable codes with the following trade-offs (for any eps > 0):
(1) Codes that can correct a fraction 1-eps of deletions with rate poly(eps)
over an alphabet of size poly(1/eps);
(2) Binary codes of rate 1-O~(sqrt(eps)) that can correct a fraction eps of
deletions; and
(3) Binary codes that can be list decoded from a fraction (1/2-eps) of
deletions with rate poly(eps)
Our work is the first to achieve the qualitative goals of correcting a
deletion fraction approaching 1 over bounded alphabets, and correcting a
constant fraction of bit deletions with rate aproaching 1. The above results
bring our understanding of deletion code constructions in these regimes to a
similar level as worst-case errors
Database Matching Under Adversarial Column Deletions
The de-anonymization of users from anonymized microdata through matching or
aligning with publicly-available correlated databases has been of scientific
interest recently. While most of the rigorous analyses of database matching
have focused on random-distortion models, the adversarial-distortion models
have been wanting in the relevant literature. In this work, motivated by
synchronization errors in the sampling of time-indexed microdata, matching
(alignment) of random databases under adversarial column deletions is
investigated. It is assumed that a constrained adversary, which observes the
anonymized database, can delete up to a fraction of the columns
(attributes) to hinder matching and preserve privacy. Column histograms of the
two databases are utilized as permutation-invariant features to detect the
column deletion pattern chosen by the adversary. The detection of the column
deletion pattern is then followed by an exact row (user) matching scheme. The
worst-case analysis of this two-phase scheme yields a sufficient condition for
the successful matching of the two databases, under the near-perfect recovery
condition. A more detailed investigation of the error probability leads to a
tight necessary condition on the database growth rate, and in turn, to a
single-letter characterization of the adversarial matching capacity. This
adversarial matching capacity is shown to be significantly lower than the
random matching capacity, where the column deletions occur randomly. Overall,
our results analytically demonstrate the privacy-wise advantages of adversarial
mechanisms over random ones during the publication of anonymized time-indexed
data