159 research outputs found
Beyond Single-Deletion Correcting Codes: Substitutions and Transpositions
We consider the problem of designing low-redundancy codes in settings where one must correct deletions in conjunction with substitutions or adjacent transpositions; a combination of errors that is usually observed in DNA-based data storage. One of the most basic versions of this problem was settled more than 50 years ago by Levenshtein, who proved that binary Varshamov-Tenengolts codes correct one arbitrary edit error, i.e., one deletion or one substitution, with nearly optimal redundancy. However, this approach fails to extend to many simple and natural variations of the binary single-edit error setting. In this work, we make progress on the code design problem above in three such variations:
- We construct linear-time encodable and decodable length-n non-binary codes correcting a single edit error with nearly optimal redundancy log n+O(log log n), providing an alternative simpler proof of a result by Cai, Chee, Gabrys, Kiah, and Nguyen (IEEE Trans. Inf. Theory 2021). This is achieved by employing what we call weighted VT sketches, a new notion that may be of independent interest.
- We show the existence of a binary code correcting one deletion or one adjacent transposition with nearly optimal redundancy log n+O(log log n).
- We construct linear-time encodable and list-decodable binary codes with list-size 2 for one deletion and one substitution with redundancy 4log n+O(log log n). This matches the existential bound up to an O(log log n) additive term
DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage
Recently, DNA storage has emerged as a promising data storage solution,
offering significant advantages in storage density, maintenance cost
efficiency, and parallel replication capability. Mathematically, the DNA
storage pipeline can be viewed as an insertion, deletion, and substitution
(IDS) channel. Because of the mathematical terra incognita of the Levenshtein
distance, designing an IDS-correcting code is still a challenge. In this paper,
we propose an innovative approach that utilizes deep Levenshtein distance
embedding to bypass these mathematical challenges. By representing the
Levenshtein distance between two sequences as a conventional distance between
their corresponding embedding vectors, the inherent structural property of
Levenshtein distance is revealed in the friendly embedding space. Leveraging
this embedding space, we introduce the DoDo-Code, an IDS-correcting code that
incorporates deep embedding of Levenshtein distance, deep embedding-based
codeword search, and deep embedding-based segment correcting. To address the
requirements of DNA storage, we also present a preliminary algorithm for long
sequence decoding. As far as we know, the DoDo-Code is the first IDS-correcting
code designed using plausible deep learning methodologies, potentially paving
the way for a new direction in error-correcting code research. It is also the
first IDS code that exhibits characteristics of being `optimal' in terms of
redundancy, significantly outperforming the mainstream IDS-correcting codes of
the Varshamov-Tenengolts code family in code rate
Low-redundancy codes for correcting multiple short-duplication and edit errors
Due to its higher data density, longevity, energy efficiency, and ease of
generating copies, DNA is considered a promising storage technology for
satisfying future needs. However, a diverse set of errors including deletions,
insertions, duplications, and substitutions may arise in DNA at different
stages of data storage and retrieval. The current paper constructs
error-correcting codes for simultaneously correcting short (tandem)
duplications and at most edits, where a short duplication generates a copy
of a substring with length and inserts the copy following the original
substring, and an edit is a substitution, deletion, or insertion. Compared to
the state-of-the-art codes for duplications only, the proposed codes correct up
to edits (in addition to duplications) at the additional cost of roughly
symbols of redundancy, thus achieving the same
asymptotic rate, where is the alphabet size and is a constant.
Furthermore, the time complexities of both the encoding and decoding processes
are polynomial when is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on
Information Theory. Furthermore, the paper was presented in part at the
ISIT2021 and ISIT202
Codes for Correcting Asymmetric Adjacent Transpositions and Deletions
Codes in the Damerau--Levenshtein metric have been extensively studied
recently owing to their applications in DNA-based data storage. In particular,
Gabrys, Yaakobi, and Milenkovic (2017) designed a length- code correcting a
single deletion and adjacent transpositions with at most
bits of redundancy. In this work, we consider a new setting where both
asymmetric adjacent transpositions (also known as right-shifts or left-shifts)
and deletions may occur. We present several constructions of the codes
correcting these errors in various cases. In particular, we design a code
correcting a single deletion, right-shift, and left-shift errors
with at most bits of redundancy where . In
addition, we investigate codes correcting -deletions, right-shift,
and left-shift errors with both uniquely-decoding and list-decoding
algorithms. Our main contribution here is the construction of a list-decodable
code with list size and with at most bits of redundancy, where . Finally, we construct
both non-systematic and systematic codes for correcting blocks of -deletions
with -limited-magnitude and adjacent transpositions
- …