35 research outputs found
A New Algebraic Approach for String Reconstruction from Substring Compositions
We consider the problem of binary string reconstruction from the multiset of
its substring compositions, i.e., referred to as the substring composition
multiset, first introduced and studied by Acharya et al. We introduce a new
algorithm for the problem of string reconstruction from its substring
composition multiset which relies on the algebraic properties of the equivalent
bivariate polynomial formulation of the problem. We then characterize specific
algebraic conditions for the binary string to be reconstructed that guarantee
the algorithm does not require any backtracking through the reconstruction,
and, consequently, the time complexity is bounded polynomially. More
specifically, in the case of no backtracking, our algorithm has a time
complexity of compared to the algorithm by Acharya et al., which has a
time complexity of , where is the length of the binary
string. Furthermore, it is shown that larger sets of binary strings are
uniquely reconstructable by the new algorithm and without the need for
backtracking leading to codebooks of reconstruction codes that are larger, by a
linear factor in size, compared to the previously known construction by
Pattabiraman et al., while having reconstruction complexity
Generalized Unique Reconstruction from Substrings
This paper introduces a new family of reconstruction codes which is motivated
by applications in DNA data storage and sequencing. In such applications, DNA
strands are sequenced by reading some subset of their substrings. While
previous works considered two extreme cases in which all substrings of
pre-defined lengths are read or substrings are read with no overlap for the
single string case, this work studies two extensions of this paradigm. The
first extension considers the setup in which consecutive substrings are read
with some given minimum overlap. First, an upper bound is provided on the
attainable rates of codes that guarantee unique reconstruction. Then, efficient
constructions of codes that asymptotically meet that upper bound are presented.
In the second extension, we study the setup where multiple strings are
reconstructed together. Given the number of strings and their length, we first
derive a lower bound on the read substrings' length that is necessary
for the existence of multi-strand reconstruction codes with non-vanishing
rates. We then present two constructions of such codes and show that their
rates approach 1 for values of that asymptotically behave like the lower
bound.Comment: arXiv admin note: text overlap with arXiv:2205.0393
Reconstruction from Noisy Substrings
This paper studies the problem of encoding messages into sequences which can
be uniquely recovered from some noisy observations about their substrings. The
observed reads comprise consecutive substrings with some given minimum overlap.
This coded reconstruction problem has applications to DNA storage. We consider
both single-strand reconstruction codes and multi-strand reconstruction codes,
where the message is encoded into a single strand or a set of multiple strands,
respectively. Various parameter regimes are studied. New codes are constructed,
some of whose rates asymptotically attain the upper bounds.Comment: 35 page
On Codes for the Noisy Substring Channel
We consider the problem of coding for the substring channel, in which
information strings are observed only through their (multisets of) substrings.
Because of applications to DNA-based data storage, due to DNA sequencing
techniques, interest in this channel has renewed in recent years. In contrast
to existing literature, we consider a noisy channel model, where information is
subject to noise \emph{before} its substrings are sampled, motivated by in-vivo
storage.
We study two separate noise models, substitutions or deletions. In both
cases, we examine families of codes which may be utilized for error-correction
and present combinatorial bounds. Through a generalization of the concept of
repeat-free strings, we show that the added required redundancy due to this
imperfect observation assumption is sublinear, either when the fraction of
errors in the observed substring length is sufficiently small, or when that
length is sufficiently long. This suggests that no asymptotic cost in rate is
incurred by this channel model in these cases.Comment: ISIT 2021 version (including all proofs
Coding for storage and testing
The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. Motivated by platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose new a family of codes that allows for both unique string reconstruction and correction of multiple mass errors.
We first consider the paradigm where the masses of substrings of the input string form the evidence set. We consider two approaches: The first approach pertains to asymmetric errors and the error-correction is achieved by introducing redundancy that scales linearly with the number of errors and logarithmically with the length of the string. The proposed construction allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the Backtracking algorithm used for the Turnpike problem. For symmetric errors, we use a polynomial characterization of the mass information and adapt polynomial evaluation code constructions for this setting. In the process, we develop new efficient decoding algorithms for a constant number of composition errors.
The second part of this dissertation addresses a practical paradigm that requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide upper and lower bounds on the asymptotic rate of the underlying codebooks. Our code constructions combine properties of binary and Dyck strings and can be extended to accommodate missing substrings in the pool.
In the final chapter of this dissertation, we focus on group testing. We begin with a review of the gold-standard testing protocol for Covid-19, real-time, reverse transcription PCR, and its properties and associated measurement data such as amplification curves that can guide the development of appropriate and accurate adaptive group testing protocols. We then proceed to examine various off-the-shelf group testing methods for Covid-19, and identify their strengths and weaknesses for the application at hand. Finally, we present a collection of new analytical results for adaptive semiquantitative group testing with combinatorial priors, including performance bounds, algorithmic solutions, and noisy testing protocols. The worst-case paradigm extends and improves upon prior work on semiquantitative group testing with and without specialized PCR noise models